rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
166 stars 23 forks source link

Rust port #106

Closed ctrlcctrlv closed 10 months ago

ctrlcctrlv commented 1 year ago

I'm very far along with porting this library to Rust since I need its data in MFEKpreview and already have a savefile-compatible .bin.

#[derive(Deserialize_enum_str, Serialize_enum_str, Clone, Debug, PartialEq, Eq, Hash, Savefile)]
#[serde(rename_all = "snake_case")]
#[repr(isize)]
pub enum Status {
    Deprecated = -0xff,
    Historical = -0xf,
    Local = -0x7,
    Transliteration = 0,
    Primary = 1,
    Secondary = 2,
}

#[derive(Deserialize_enum_str, Serialize_enum_str, Clone, Debug, PartialEq, Eq, Hash, Savefile)]
#[serde(rename_all = "snake_case")]
#[repr(isize)]
pub enum Validity {
    Verified = 0,
    Todo = -0xff,
    Draft,
    Preliminary,
}

#[derive(Deserialize_enum_str, Serialize_enum_str, Clone, Debug, PartialEq, Eq, Hash, Savefile)]
pub enum Source {
    Wikipedia,
    Omniglot,
    Unicode,
    #[serde(other)]
    Arbitrary(String),
}

#[derive(Clone, Debug, PartialEq, Eq, Hash, rkyv::Archive)]
pub enum Year {
    One(NaiveDate),
    Range(NaiveDate, NaiveDate)
}

impl Into<Vec<NaiveDate>> for Year {
    fn into(self) -> Vec<NaiveDate> {
        match self {
            Self::One(y) => vec![y],
            Self::Range(s, e) => vec![s, e]
        }
    }
}

impl From<Vec<NaiveDate>> for Year {
    fn from(f: Vec<NaiveDate>) -> Self {
        match f.len() {
            1 => Self::One(f[0]),
            2 => Self::Range(f[0], f[1]),
            _ => panic!("Invalid serialized Year")
        }
    }
}

#[derive(Clone, Debug, Serialize, Deserialize, Savefile)]
struct LanguageData {
    name: String,
    #[serde(skip_serializing_if = "Option::is_none")]
    preferred_name: Option<String>,
    #[serde(skip_serializing_if = "Vec::is_empty", default)]
    orthographies: Vec<OrthographyData>,
    #[serde(skip_serializing_if = "HashSet::is_empty", default)]
    #[serde(deserialize_with = "coerce_source")]
    source: HashSet<Source>,
    #[serde(skip_serializing_if = "Option::is_none")]
    speakers: Option<u128>,
    #[serde(skip_serializing_if = "Option::is_none")]
    speakers_date: Option<Year>,
    #[serde(skip_serializing_if = "HashSet::is_empty", default)]
    includes: HashSet<String>,
    validity: String,
    #[serde(skip_serializing_if = "Option::is_none")]
    note: Option<String>,
}

#[derive(Clone, Debug, Serialize, Deserialize, Savefile)]
struct OrthographyData {
    #[serde(skip_serializing_if = "Option::is_none")]
    autonym: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    auxiliary: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    base: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    inherit: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    marks: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    numerals: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    preferred_as_group: Option<bool>,
    #[serde(skip_serializing_if = "Option::is_none")]
    punctuation: Option<String>,
    script: Option<String>,
    status: Option<Status>,
    #[serde(skip_serializing_if = "Vec::is_empty", default)]
    design_requirements: Vec<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    design_alternates: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    note: Option<String>,
}

#[derive(Clone, Debug, Serialize, Deserialize, Savefile)]
struct Hyperglot(BTreeMap<String, LanguageData>);

Are upstream developers interested in this?

Conversion is nothing fancy:

fn main() {
    let hyperglot_yaml = Path::new("hyperglot.yaml");
    let hyperglot_yaml_file = File::open(hyperglot_yaml).unwrap();
    let hyperglot_yaml_reader = BufReader::new(hyperglot_yaml_file);

    let hyperglot_yaml_deserialized: Hyperglot = serde_yaml::from_reader(hyperglot_yaml_reader).unwrap();
    save_file("hyperglot.bin", 0, &hyperglot_yaml_deserialized).unwrap();

    let hyperglot_yaml_file = File::open(hyperglot_yaml).unwrap();
    let mut hyperglot_yaml_reader = BufReader::new(hyperglot_yaml_file);
    assert!(hyperglot_yaml_deserialized.0.len() == serde_yaml::from_reader::<&mut BufReader<File>, serde_yaml::Value>(&mut hyperglot_yaml_reader).unwrap().as_mapping().unwrap().len());
}
ctrlcctrlv commented 1 year ago

Also now that I've done this some of the data types make no sense and I'll be changing them for the Rust port. Basically there will be two groups of struct's, one that is close to upstream hyperglot data and one that is (my idea of a) logical Rust struct, with conversion routines between them and savefile of only the final struct's for use in hyperglot.rs crate to be used by consumers like MFEKpreview etc.

For example, base: Option<String> / marks / auxiliary all feel like they should have type Vec<char> not space-separated String.

In a few places I think contributors unintentionally created maps where we didn't want maps because YAML is easy to shoot yourself in the foot with. E.g. some source vars contain unquoted : making them maps, not strings.

I wrote this to fix it for my case:

pub fn coerce_source<'de, D>(de: D) -> Result<HashSet<Source>, D::Error>
    where D: Deserializer<'de>
{
    let ret: HashSet<Source> = HashSet::<Source>::deserialize(de)?;
    ret.into_iter().map(|so| -> Result<Source, D::Error> {
        match so {
            Source::Arbitrary(ref s) => if s.starts_with("Unicode") {
                Ok(Source::Unicode)
            } else {
                Ok(so)
            },
            _ => Ok(so)
        }
    }).collect()
}

The includes array feels like a kludge and isn't actually used, example fas doesn't exist so I'll probably drop it?

ctrlcctrlv commented 1 year ago

Oh the speakers_date field also feels mistyped.

Fixing that was not fun:

impl<'de> Deserialize<'de> for Year {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: Deserializer<'de>,
    {
        use serde::de;
        let s = String::deserialize(deserializer)?;
        let parts: Vec<&str> = s.split(|c| c == '-' || c == '–').collect();
        if parts.len() == 1 {
            let year: i32 = parts[0].trim().parse()
                .map_err(de::Error::custom)?;
            let date = NaiveDate::from_ymd(year, 1, 1);
            Ok(Year::One(date))
        } else if parts.len() == 2 {
            let start_year: i32 = parts[0].trim().parse()
                .map_err(de::Error::custom)?;
            let end_year: i32 = parts[1].trim().parse()
                .map_err(de::Error::custom)?;
            let start_date = NaiveDate::from_ymd(start_year, 1, 1);
            let end_date = NaiveDate::from_ymd(end_year, 12, 31);
            Ok(Year::Range(start_date, end_date))
        } else {
            Err(de::Error::custom("Invalid year format"))
        }
    }
}
kontur commented 1 year ago

Hey! Yes, sounds interesting to follow along. Always happy to improve on the this side as well. Note that we are doing some decomposition and base/mark crosschecks between the different base, auxiliary and marks attributes.

In a few places I think contributors unintentionally created maps where we didn't want maps because YAML is easy to shoot yourself in the foot with. E.g. some source vars contain unquoted : making them maps, not strings.

Those should be fixed in the source! yaml was chosen particularly because it's "editor" friendly, but it entails some drawbacks.

Oh the speakers_date field also feels mistyped. Fixing that was not fun

How would you suggest it should be in the data? Some of the sources explicitly have date ranges, since census may extend over several year. We were debating using only the end dates of ranges, but it felt like an incorrect representation of the data sources. Either way, this is mostly informational to give an idea how current the speaker date is.

Re. the includes see the readme about macrolanguages — where are you finding "fas" in the data of an includes?

For example, base: Option / marks / auxiliary all feel like they should have type Vec not space-separated String.

One of the finer points here is that there may be unencoded base + mark combinations that are considered a required character of an orthography. We want to explicitly have a way of telling readers that the combination is required when reading the data; from the data point of view, such unencoded base + mark combinations would be indeed two chars. For the entire base/auxiliary/marks attribute, we picked string over list for readability, but you are right that internally they are split by space and treated as a list. As mentioned above, the python library performs some transformations on the yaml data.

ctrlcctrlv commented 1 year ago

One of the finer points here is that there may be unencoded base + mark combinations that are considered a required character of an orthography. We want to explicitly have a way of telling readers that the combination is required when reading the data; from the data point of view, such unencoded base + mark combinations would be indeed two chars. For the entire base/auxiliary/marks attribute, we picked string over list for readability, but you are right that internally they are split by space and treated as a list. As mentioned above, the python library performs some transformations on the yaml data.

Wow thanks for telling me that haha. I hadn't gotten to it yet but now I know it needs to be Vec<(char, Option<char>)>! Very helpful :)

kontur commented 1 year ago

See for example "Hausa" (there was a few such languages, I don't recall off hand), (small r with combining tilde) can be composed like this, but it is not an encoded character by itself, in the data and test. For Hausa this means combining tilde is a required base mark, whereas other marks are auxiliary marks, since they are only required for auxiliary characters. marks list all marks a orthography requires, but we make this distinction in the actual language support test, e.g. like in the case of Hausa having to insist on combining tilde being in the charset for base support. When saving data we also inspect the base and auxiliary and perform canonical unicode decomposition to see which marks are present — the cli tool for checking support has a flag to distinguish between orthography characters being encoded in the font, or orthography characters being composable from base + mark combinations (happening here).

kontur commented 1 year ago

(Actually our initial take was to insist on base + mark combinations for valid (base) support, but we got feedback by many font makers wondering why their fonts with only precomposed/encoded "composite characters" didn't validate—when they lacked the combining marks—so the default for the CLI tool's check is looser now.)