rfilmyer commented 2 years ago

As mentioned in #4, many MB entity attributes could be better represented as enums, rather than Strings.

As part of this pull request, I want to try and convert as many things as is feasible. Since this is a breaking change for MB's API, it might be best to do as many as possible in a single PR.

Conceptually, we should be able to grab the values from some source of truth in MB backend code and just write corresponding variants in rust code. However, it turns out that the values are only stored in the MB database, not in source code.

Enums to create:

Aliases

[ ] Alias Types
Areas
[x] Area Type (Double-checking previous work)
Artists
[x] Artist Type (Double-checking previous work)
[x] Artist Gender (Double-checking previous work)
Events
[x] Event Type
Instruments
[ ] Instrument ⚠ Over 1,000 variants in DB
[x] Instrument Type
Labels
[x] Label Type
Places
[x] Place Type (Double-checking previous work)
Releases
[ ] Release Status (Double-checking previous work)
[x] Release Quality (Double-checking previous work)
[x] Release Packaging
[ ] Language ⚠ Over 7,000 variants in DB
[ ] Script ⚠150 variants in DB
Release Groups
[x] Primary Type
[x] Secondary Type
Series
[x] Series Type
Works
[x] Work Attributes
[x] Work Types
[x] Work Musical Keys
[ ] Other Attribute Subtypes (Raga/Mode) ⚠ Over 8,000 variants in DB

rfilmyer commented 2 years ago

A solution to some issues in #4 (specifically language) is likely to be infeasible if we want to encode every language in the DB. Even scripts (Latin, Devanagari, Arabic) will likely be pretty large at over 100 scripts.

oknozor commented 2 years ago

@rfilmyer no rush but when this is ready can you rename your commits to respect the conventional commits specification ? I use cocogitto to create release automatically.

rfilmyer commented 2 years ago

Note for posterity: I ended up writing this script to generate the enums from the table (I actually used a Jupyter notebook, but same difference)

#!/usr/bin/env python
# coding: utf-8

import re
import os
import textwrap

import pandas as pd

table_path = r"C:\Users\rdoge\Downloads\mbdump\release_group_secondary_type"
table_name = os.path.basename(table_path)
print(table_name)

table_name_tokens = table_name.split("_")
human_readable_table_name = " ".join(table_name_tokens)
print(human_readable_table_name)

table = pd.read_csv(table_path, header=None, sep="\t", index_col=0, names=["id", "name", 2, "sort_ranking", "description", "uuid"], na_values=r"\N", dtype={"description": "string", "sort_ranking": "Int64"})
table

table["sort_ranking"] = table["sort_ranking"].fillna(0)

sorted_table = table.sort_values(by=["sort_ranking", "name"])
sorted_table

def to_camel_case(identifier: str) -> str:
    split = re.split("[\W_]", identifier)
    recapitalized = [word.capitalize() for word in split]
    combined = "".join(recapitalized)
    return combined

enum_docstring_wrapper = textwrap.TextWrapper(width=100, initial_indent="/// ", subsequent_indent="/// ")
variant_docstring_wrapper = textwrap.TextWrapper(width=100, initial_indent="    /// ", subsequent_indent="    /// ")

# Enum docstring
enum_docstring_lines = []
enum_docstring_lines += [f"The type of a MusicBrainz {table_name_tokens[0]} entity."]
enum_docstring_lines += [f"Note that this enum is `non_exhaustive`; " 
                         f"The list of {table_name_tokens[0]} types is subject to change " 
                        "and these changes are only reflected in the DB, not in actual MB code."]
enum_docstring_lines += [f"Variants are derived from the `{table_name}` table in the MusicBrainz database."]
for line in enum_docstring_lines:
    for wrapped_line in enum_docstring_wrapper.wrap(line):
        print(wrapped_line)
print("#[non_exhaustive]")
print("#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]")
print(f"pub enum {to_camel_case(table_name)} {{")
for row in sorted_table.itertuples():
    if not pd.isna(row.description):
        description_lines = variant_docstring_wrapper.wrap(row.description)
        for line in description_lines:
            print(line)
    camel_case_name = to_camel_case(row.name)
    if row.name != camel_case_name:
        print(f"    #[serde(rename = \"{row.name}\")]")
    print(f"    {camel_case_name},")
print(f"    /// Any {table_name} that does not yet have a corresponding variant in this enum.")
print(f"    /// If you ever see a `{to_camel_case(table_name)}::Unrecognized{to_camel_case(table_name)}` in the wild, let us know and file an issue/pull request!")
print("    #[serde(other)]")
print(f"    Unrecognized{to_camel_case(table_name)},")
print("}")

rfilmyer commented 2 years ago

@oknozor this should be ready for another CI run. commits reformatted as well.

rfilmyer commented 2 years ago

I think this is just waiting for the final CI run but the PR is complete.

Aliases are hard (the valid list is unique to each entity type, so it would either mean creating a master Alias enum with variants that are only valid for some use cases, or creating separate enums for each type of instance.), so I'm leaving that out of scope of this PR.

Language/Script and Work Raga/Mode have an extremely large number of variants, so likely would not be a good fit for an enum. There are over 7,000 valid languages for MusicBrainz.

oknozor commented 2 years ago

This looks great but you did not use a fallback variant with an inner string. I am ok with this as long as the enums contains all possible variant currently in musicbrainz database. Is it the case ? I don't know if it's possible but it would be nice to query the database on CI runs to know if we are missing some enum variant.

rfilmyer commented 2 years ago

Morning!

As far as I can tell, it is unrealistic to check for new variants in CI. To get my list of variants, I had to download a recent copy of the MusicBrainz database, roughly 4GB for the mbdump.tar.bz2 blob, and inspect the individual tables (dumped as tab-delimited files) for variants. Presumably, checking for up-to-date variants in CI would require something similar.

rfilmyer commented 2 years ago

And serde's fallthrough in #[serde(other)] only allows a unit variant. Rewriting it a different way would likely require a substantial amount of boilerplate code.

rfilmyer commented 2 years ago

But yes, those enum lists were derived from the most recent version of the MusicBrainz database, so they should be complete until the next schema change.

oknozor commented 2 years ago

Indeed that seems unrealistic, I though maybe there was an HTTP endpoint to query those data. I am merging this then. Thanks for the hard work @rfilmyer !

yvanzo commented 2 years ago

I though maybe there was an HTTP endpoint to query those data.

Which data would you like to get through ws/2?

oknozor commented 2 years ago

Hey @yvanzo actually it would be nice to have the list of all possible values for SeriesType, AreaType, ArtistType, EventType, InstrumentType, LabelType, ReleasePackaging, ReleaseGroupPrimaryType, ReleaseGroupSecondaryType, SeriesType and maybe some others values.

The idea for musicbrainz_rs is to be able to check for unknown values in the CI/tests and update the api bindings accordingly (without the need to download the whole DB), making sure our enum are exhaustive and staying up to date with the REST api. We could even generate the enum variant with some procmacros (but I am not sure we really want this) .

I would totally understand if this is not something you want in musicbrainz API but I think exposing data like artist type and instrument type would make the api data easier to discover. Also this would allow auto-completion on those fields in front-ends.

yvanzo commented 2 years ago

I see. Good suggestion. Created https://tickets.metabrainz.org/browse/MBS-12093

oknozor commented 2 years ago

Thank you @yvanzo I am moving this to a dedicated issue : #72

oknozor / musicbrainz_rs

Adding Enums #70

Enums to create:

Aliases

Areas

Artists

Events

Instruments

Labels

Places

Releases

Release Groups

Series

Works