Pull vernacular and analysis language from flex project data

hahn-kev commented 3 months ago

this would be similar to the entries count data. We would store it in the flex metadata table, and we want to make it searchable/queryable in some way.

Tasks:

[x] pull language data from xml files in hg
[x] display language data on project page
[x] allow filtering projects by language data via graphql

Non tasks:

define UI in admin page to filter based on vernacular.

rmunn commented 2 months ago

Summary of discussion from #844:

Writing system data for FLEx projects is stored in the General/LanguageProject.langproj file.
We could get that without checking out the repo by doing hg cat -r tip General/LanguageProject.langproj
We could just grab the whole XML file (it's not that large) and parse it in .NET, ...
... Or we could extract just the parts we want so that we don't have to send too much data over the network by doing the following:

hg cat -r tip General/LanguageProject.langproj | sed -n -e '/<AnalysisWss>/,/<\/AnalysisWss>/p' -e '/<VernWss>/,/<\/VernWss>/p' -e '/<CurAnalysisWss>/,/<\/CurAnalysisWss>/p' -e '/<CurVernWss>/,/<\/CurVernWss>/p' -e '/<CurPronunWss>/,/<\/CurPronunWss>/p'

Then slap an opening and closing tag around that and parse it.
Either way, the writing system tags for each category are a space-separated list of strings.
Anything in CurPronunWss is also found in CurVernWss, so if you want only the non-pronunciation vernaculars then you'd take CurVernWss and subtract CurPronunWss.
It's unclear to me (yet) whether the CurVernWss vs. VernWss distinction is all that important. We could check against multiple current projects and find out.

One example of this data from a project I had lying around because I was debugging a Send/Receive issue:

<AnalysisWss>
    <Uni>en</Uni>
</AnalysisWss>
<CurAnalysisWss>
    <Uni>en</Uni>
</CurAnalysisWss>
<CurPronunWss>
    <Uni>qaa-fonipa</Uni>
</CurPronunWss>
<CurVernWss>
    <Uni>qaa-fonipa qaa</Uni>
</CurVernWss>
<VernWss>
    <Uni>qaa-fonipa qaa</Uni>
</VernWss>

So as you can see, CurVernWss - CurPronunWss would have left just qaa, which is the "Unknown" language. Which goes to illustrate that we won't always be able to extract the language from a FLEx project...

rmunn commented 2 months ago

In #844 Tim said he thought we'd want to output the entire XML file and parse it in C#. That's feasible; I just checked on staging and found that most LanguageProject.langproj files were about 70 kilobytes in size, nearly all of that from the Anthropology Categories list that's included in most FLEx projects. (A FLEx project where that list has been omitted has a .langproj file of less than 2 kilobytes). If the project had lots of media files (whose filenames are listed in the .langproj file) then it was larger: I saw 100KB, 160KB, and one 200KB file.

Since it's a simple Unix one-liner (which also takes VERY little CPU time as it's not parsing XML, just doing simple string comparisons) to focus on just the writing system data and not transmit (or XML-parse) the rest of the file, thereby turning 70KB (or up to 200KB) into less than 1KB, I decided to go with the sed command to return only the XML elements we're interested in.

rmunn commented 2 months ago

Oh, and one large project I looked at also proves that you don't always want "CurVernWss - CurPronunWss". It has CurPronunWss being just one writing system which I'll call xyz. The CurVernWss has xyz xyz-Latn-x-majority xyz-Zxxx-x-majority-audio xyz-Latn-x-minority1 xyz-Zxxx-x-minority1-audio xyz-Latn-x-minority2 xyz-Zxxx-x-minority2-audio. But the language code for that project should indeed be xyz.

So the logic we'll want is to just look at CurVernWss and find the language that is shared between all the tags. If there are more than one, we'll want to flag that somehow. Perhaps I'll write a one-time script that goes through every project on the database, throttled to do just one per second or something, and grabs just the CurVernWss list. Then I'll be able to find out how many of them, if any, have more than one writing system in the vernacular list. Hopefully that number will be zero, but let's find out.

rmunn commented 2 months ago

Preliminary results from that research (all actual language codes except en and qaa have been replaced with xyz or aaa or bbb or similar in the list below):

Most projects do indeed have just one language code, and that language code usually matches the name of the project (e.g. a project named xyz-flex probably has xyz, xyz-Latn, and xyz-fonipa as vernacular writing systems).
Some projects have muiltiple writing systems in the vernacular list; one of them, for example, was named xyz-flex but had writing systems xyz xyz-fonipa xyz-Latn aaa bbb ccc en ddd-Latn and so on. But xyz was the majority and matched the language identified by the project code.
Some projects had analysis languages included in the vernacular list; one had en xyz qaa-fonipa-x-xyz-etic but the analysis languages list was just en. For that one, xyz would be the better choice, although in fact I could tell that this project was not about the xyz language and that would have been misidentified. But in this particular case there was nothing in the data that would have identified the language correctly. Another had aaa xyz as vernacular tags, and en aaa as analysis tags. (Also, aaa was listed in CurPronunWss, and it turned out that indeed, aaa was being used as the language tag to record, in the lexeme field, how the word was pronounced.)
Some projects had language tags that did not match the project code, e.g. one language tag xxy had a project code named xyy-flex. The name made it clear that xxy was the correct tag, and it turns out that there was already an xxy-flex project which is likely why the xyy-flex project was created, with a project code that was close but not quite correct. The language tag in the CurVernWss list was correct, though. I've also seen xy in the vern list with xyz in the project code, and when I looked it up, xy was the correct two-letter tag for xyz. So if comparing tags to project code segments to confirm a guess, normalizing 2-to-3 segments might be a good idea first.
There are a number of xyz1-flex-gial, xyz2-flex-gial, ..., xyz11-flex-gial projects. It seems likely that any project code ending in flex-gial or flex-diu is a training project. When we do project cleanup, we may want to archive those in a deep-freeze type of backup (so they can be restored if anyone actually wants them) and then delete them from the live server. That's unrelated to the language-identifying task, but it's worth noting.
Some projects named xyz-flex had only qaa-x-xyz and qaa-x-xyz-Latn as vernacular tags. If the vernacular language gets identified as qaa (the official "Unknown language" tag), but there's a qaa-x-xyz where there are exactly three letters after the x- and those three letters match the project code, it's worth guessing with medium confidence that that's the language. (I've also seen qaa-Latn-US-x-xyz and xyz would turn out to be the correct language tag even though it wasn't in the project code.
Some projects had project code aaa-bbb-flex and it turned out that bbb was the vernacular.

But all those problems were rare. By far the vast majority had vernacular lists that were either a single xyz, or a short list like xyz-fonipa-x-emic xyz-x-variatio xyz qaa-fonipa-x-enIPA.

rmunn commented 2 months ago

Conclusions:

If there's just one language in the list, there's no need to guess. If there are multiple, we might need to guess.
We may want to strip analysis langs out of the vernacular list before guessing the language
We may want to assign confidence levels to langauge determinations: High, Medium, and Low, for example.
- High could be for "just one language in the list", Medium for "one language left after stripping analysis languages out", Low for "more than one language in the list and stripping analysis languages out didn't help, so we guessed based on which one was first".
We may want to split the project code into hyphen-delimited segments, discarding a -flex part at the end, and assign higher confidence to any tag also located in part of the project code.

rmunn commented 2 months ago

Decision made at design meeting today: language data that we'll display will be the raw list of tags, with the default vernacular (and default analysis, if we display that) highlighted or bolded.

We've decided to add an isDefault boolean to the object we store in JSON, following the rules of the FLEx data: the first one in the vernacular/analysis list is the default one for that list. The frontend will then be able to apply styles based on whether isDefault is truthy.

rmunn commented 2 months ago

GraphQL queries for writing systems on projects will get a little verbose, but they work:

query Example {
    projects(where: {flexProjectMetadata: {writingSystems: {analysisWss: {some: {tag: {eq: "en"}}}}}}) {
        id
        code
        flexProjectMetadata {
            writingSystems {
                analysisWss {
                    tag
                    isActive
                    isDefault
                }
            }
        }
    }
}

sillsdev / languageforge-lexbox

Pull vernacular and analysis language from flex project data #922