Closed hahn-kev closed 2 months ago
Related (dupe?): https://github.com/sillsdev/languageforge-lexbox/issues/844
Summary of discussion from #844:
General/LanguageProject.langproj
file.hg cat -r tip General/LanguageProject.langproj
hg cat -r tip General/LanguageProject.langproj | sed -n -e '/<AnalysisWss>/,/<\/AnalysisWss>/p' -e '/<VernWss>/,/<\/VernWss>/p' -e '/<CurAnalysisWss>/,/<\/CurAnalysisWss>/p' -e '/<CurVernWss>/,/<\/CurVernWss>/p' -e '/<CurPronunWss>/,/<\/CurPronunWss>/p'
One example of this data from a project I had lying around because I was debugging a Send/Receive issue:
<AnalysisWss>
<Uni>en</Uni>
</AnalysisWss>
<CurAnalysisWss>
<Uni>en</Uni>
</CurAnalysisWss>
<CurPronunWss>
<Uni>qaa-fonipa</Uni>
</CurPronunWss>
<CurVernWss>
<Uni>qaa-fonipa qaa</Uni>
</CurVernWss>
<VernWss>
<Uni>qaa-fonipa qaa</Uni>
</VernWss>
So as you can see, CurVernWss - CurPronunWss would have left just qaa
, which is the "Unknown" language. Which goes to illustrate that we won't always be able to extract the language from a FLEx project...
In #844 Tim said he thought we'd want to output the entire XML file and parse it in C#. That's feasible; I just checked on staging and found that most LanguageProject.langproj files were about 70 kilobytes in size, nearly all of that from the Anthropology Categories list that's included in most FLEx projects. (A FLEx project where that list has been omitted has a .langproj file of less than 2 kilobytes). If the project had lots of media files (whose filenames are listed in the .langproj file) then it was larger: I saw 100KB, 160KB, and one 200KB file.
Since it's a simple Unix one-liner (which also takes VERY little CPU time as it's not parsing XML, just doing simple string comparisons) to focus on just the writing system data and not transmit (or XML-parse) the rest of the file, thereby turning 70KB (or up to 200KB) into less than 1KB, I decided to go with the sed
command to return only the XML elements we're interested in.
Oh, and one large project I looked at also proves that you don't always want "CurVernWss - CurPronunWss". It has CurPronunWss being just one writing system which I'll call xyz
. The CurVernWss has xyz xyz-Latn-x-majority xyz-Zxxx-x-majority-audio xyz-Latn-x-minority1 xyz-Zxxx-x-minority1-audio xyz-Latn-x-minority2 xyz-Zxxx-x-minority2-audio
. But the language code for that project should indeed be xyz
.
So the logic we'll want is to just look at CurVernWss and find the language that is shared between all the tags. If there are more than one, we'll want to flag that somehow. Perhaps I'll write a one-time script that goes through every project on the database, throttled to do just one per second or something, and grabs just the CurVernWss list. Then I'll be able to find out how many of them, if any, have more than one writing system in the vernacular list. Hopefully that number will be zero, but let's find out.
Preliminary results from that research (all actual language codes except en
and qaa
have been replaced with xyz
or aaa
or bbb
or similar in the list below):
xyz-flex
probably has xyz
, xyz-Latn
, and xyz-fonipa
as vernacular writing systems).xyz-flex
but had writing systems xyz xyz-fonipa xyz-Latn aaa bbb ccc en ddd-Latn
and so on. But xyz
was the majority and matched the language identified by the project code.en xyz qaa-fonipa-x-xyz-etic
but the analysis languages list was just en
. For that one, xyz
would be the better choice, although in fact I could tell that this project was not about the xyz
language and that would have been misidentified. But in this particular case there was nothing in the data that would have identified the language correctly. Another had aaa xyz
as vernacular tags, and en aaa
as analysis tags. (Also, aaa
was listed in CurPronunWss, and it turned out that indeed, aaa
was being used as the language tag to record, in the lexeme field, how the word was pronounced.)xxy
had a project code named xyy-flex
. The name made it clear that xxy
was the correct tag, and it turns out that there was already an xxy-flex
project which is likely why the xyy-flex
project was created, with a project code that was close but not quite correct. The language tag in the CurVernWss list was correct, though. I've also seen xy
in the vern list with xyz
in the project code, and when I looked it up, xy
was the correct two-letter tag for xyz
. So if comparing tags to project code segments to confirm a guess, normalizing 2-to-3 segments might be a good idea first.xyz1-flex-gial
, xyz2-flex-gial
, ..., xyz11-flex-gial
projects. It seems likely that any project code ending in flex-gial
or flex-diu
is a training project. When we do project cleanup, we may want to archive those in a deep-freeze type of backup (so they can be restored if anyone actually wants them) and then delete them from the live server. That's unrelated to the language-identifying task, but it's worth noting.xyz-flex
had only qaa-x-xyz
and qaa-x-xyz-Latn
as vernacular tags. If the vernacular language gets identified as qaa
(the official "Unknown language" tag), but there's a qaa-x-xyz
where there are exactly three letters after the x-
and those three letters match the project code, it's worth guessing with medium confidence that that's the language. (I've also seen qaa-Latn-US-x-xyz
and xyz would turn out to be the correct language tag even though it wasn't in the project code.aaa-bbb-flex
and it turned out that bbb
was the vernacular.But all those problems were rare. By far the vast majority had vernacular lists that were either a single xyz
, or a short list like xyz-fonipa-x-emic xyz-x-variatio xyz qaa-fonipa-x-enIPA
.
Conclusions:
-flex
part at the end, and assign higher confidence to any tag also located in part of the project code.Decision made at design meeting today: language data that we'll display will be the raw list of tags, with the default vernacular (and default analysis, if we display that) highlighted or bolded.
We've decided to add an isDefault
boolean to the object we store in JSON, following the rules of the FLEx data: the first one in the vernacular/analysis list is the default one for that list. The frontend will then be able to apply styles based on whether isDefault
is truthy.
GraphQL queries for writing systems on projects will get a little verbose, but they work:
query Example {
projects(where: {flexProjectMetadata: {writingSystems: {analysisWss: {some: {tag: {eq: "en"}}}}}}) {
id
code
flexProjectMetadata {
writingSystems {
analysisWss {
tag
isActive
isDefault
}
}
}
}
}
this would be similar to the entries count data. We would store it in the flex metadata table, and we want to make it searchable/queryable in some way.
Tasks:
Non tasks: