Closed ctrlcctrlv closed 10 months ago
Also now that I've done this some of the data types make no sense and I'll be changing them for the Rust port. Basically there will be two groups of struct
's, one that is close to upstream hyperglot data and one that is (my idea of a) logical Rust struct, with conversion routines between them and savefile
of only the final struct
's for use in hyperglot.rs
crate to be used by consumers like MFEKpreview etc.
For example, base: Option<String>
/ marks
/ auxiliary
all feel like they should have type Vec<char>
not space-separated String
.
In a few places I think contributors unintentionally created maps where we didn't want maps because YAML is easy to shoot yourself in the foot with. E.g. some source
vars contain unquoted :
making them maps, not strings.
I wrote this to fix it for my case:
pub fn coerce_source<'de, D>(de: D) -> Result<HashSet<Source>, D::Error>
where D: Deserializer<'de>
{
let ret: HashSet<Source> = HashSet::<Source>::deserialize(de)?;
ret.into_iter().map(|so| -> Result<Source, D::Error> {
match so {
Source::Arbitrary(ref s) => if s.starts_with("Unicode") {
Ok(Source::Unicode)
} else {
Ok(so)
},
_ => Ok(so)
}
}).collect()
}
The includes
array feels like a kludge and isn't actually used, example fas
doesn't exist so I'll probably drop it?
Oh the speakers_date
field also feels mistyped.
Fixing that was not fun:
impl<'de> Deserialize<'de> for Year {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
use serde::de;
let s = String::deserialize(deserializer)?;
let parts: Vec<&str> = s.split(|c| c == '-' || c == '–').collect();
if parts.len() == 1 {
let year: i32 = parts[0].trim().parse()
.map_err(de::Error::custom)?;
let date = NaiveDate::from_ymd(year, 1, 1);
Ok(Year::One(date))
} else if parts.len() == 2 {
let start_year: i32 = parts[0].trim().parse()
.map_err(de::Error::custom)?;
let end_year: i32 = parts[1].trim().parse()
.map_err(de::Error::custom)?;
let start_date = NaiveDate::from_ymd(start_year, 1, 1);
let end_date = NaiveDate::from_ymd(end_year, 12, 31);
Ok(Year::Range(start_date, end_date))
} else {
Err(de::Error::custom("Invalid year format"))
}
}
}
Hey! Yes, sounds interesting to follow along. Always happy to improve on the this side as well. Note that we are doing some decomposition and base/mark crosschecks between the different base
, auxiliary
and marks
attributes.
In a few places I think contributors unintentionally created maps where we didn't want maps because YAML is easy to shoot yourself in the foot with. E.g. some source vars contain unquoted : making them maps, not strings.
Those should be fixed in the source! yaml was chosen particularly because it's "editor" friendly, but it entails some drawbacks.
Oh the speakers_date field also feels mistyped. Fixing that was not fun
How would you suggest it should be in the data? Some of the sources explicitly have date ranges, since census may extend over several year. We were debating using only the end dates of ranges, but it felt like an incorrect representation of the data sources. Either way, this is mostly informational to give an idea how current the speaker date is.
Re. the includes
see the readme about macrolanguages — where are you finding "fas" in the data of an includes?
For example, base: Option
/ marks / auxiliary all feel like they should have type Vec not space-separated String.
One of the finer points here is that there may be unencoded base + mark combinations that are considered a required character of an orthography. We want to explicitly have a way of telling readers that the combination is required when reading the data; from the data point of view, such unencoded base + mark combinations would be indeed two char
s. For the entire base/auxiliary/marks attribute, we picked string over list for readability, but you are right that internally they are split by space and treated as a list. As mentioned above, the python library performs some transformations on the yaml data.
One of the finer points here is that there may be unencoded base + mark combinations that are considered a required character of an orthography. We want to explicitly have a way of telling readers that the combination is required when reading the data; from the data point of view, such unencoded base + mark combinations would be indeed two chars. For the entire base/auxiliary/marks attribute, we picked string over list for readability, but you are right that internally they are split by space and treated as a list. As mentioned above, the python library performs some transformations on the yaml data.
Wow thanks for telling me that haha. I hadn't gotten to it yet but now I know it needs to be Vec<(char, Option<char>)>
! Very helpful :)
See for example "Hausa" (there was a few such languages, I don't recall off hand), r̃
(small r with combining tilde) can be composed like this, but it is not an encoded character by itself, in the data and test. For Hausa this means combining tilde is a required base mark, whereas other marks are auxiliary marks, since they are only required for auxiliary characters. marks
list all marks a orthography requires, but we make this distinction in the actual language support test, e.g. like in the case of Hausa having to insist on combining tilde being in the charset for base support. When saving data we also inspect the base
and auxiliary
and perform canonical unicode decomposition to see which marks
are present — the cli tool for checking support has a flag to distinguish between orthography characters being encoded in the font, or orthography characters being composable from base + mark combinations (happening here).
(Actually our initial take was to insist on base + mark combinations for valid (base) support, but we got feedback by many font makers wondering why their fonts with only precomposed/encoded "composite characters" didn't validate—when they lacked the combining marks—so the default for the CLI tool's check is looser now.)
I'm very far along with porting this library to Rust since I need its data in MFEKpreview and already have a
savefile
-compatible.bin
.Are upstream developers interested in this?
Conversion is nothing fancy: