sillsdev / languageforge-lexbox

Lexbox, SIL linguistic data hub
MIT License
7 stars 2 forks source link

Pull vernacular and analysis language from flex project data #922

Closed hahn-kev closed 2 months ago

hahn-kev commented 3 months ago

this would be similar to the entries count data. We would store it in the flex metadata table, and we want to make it searchable/queryable in some way.

Tasks:

Non tasks:

rmunn commented 2 months ago

Related (dupe?): https://github.com/sillsdev/languageforge-lexbox/issues/844

rmunn commented 2 months ago

Summary of discussion from #844:

hg cat -r tip General/LanguageProject.langproj | sed -n -e '/<AnalysisWss>/,/<\/AnalysisWss>/p' -e '/<VernWss>/,/<\/VernWss>/p' -e '/<CurAnalysisWss>/,/<\/CurAnalysisWss>/p' -e '/<CurVernWss>/,/<\/CurVernWss>/p' -e '/<CurPronunWss>/,/<\/CurPronunWss>/p'

One example of this data from a project I had lying around because I was debugging a Send/Receive issue:

<AnalysisWss>
    <Uni>en</Uni>
</AnalysisWss>
<CurAnalysisWss>
    <Uni>en</Uni>
</CurAnalysisWss>
<CurPronunWss>
    <Uni>qaa-fonipa</Uni>
</CurPronunWss>
<CurVernWss>
    <Uni>qaa-fonipa qaa</Uni>
</CurVernWss>
<VernWss>
    <Uni>qaa-fonipa qaa</Uni>
</VernWss>

So as you can see, CurVernWss - CurPronunWss would have left just qaa, which is the "Unknown" language. Which goes to illustrate that we won't always be able to extract the language from a FLEx project...

rmunn commented 2 months ago

In #844 Tim said he thought we'd want to output the entire XML file and parse it in C#. That's feasible; I just checked on staging and found that most LanguageProject.langproj files were about 70 kilobytes in size, nearly all of that from the Anthropology Categories list that's included in most FLEx projects. (A FLEx project where that list has been omitted has a .langproj file of less than 2 kilobytes). If the project had lots of media files (whose filenames are listed in the .langproj file) then it was larger: I saw 100KB, 160KB, and one 200KB file.

Since it's a simple Unix one-liner (which also takes VERY little CPU time as it's not parsing XML, just doing simple string comparisons) to focus on just the writing system data and not transmit (or XML-parse) the rest of the file, thereby turning 70KB (or up to 200KB) into less than 1KB, I decided to go with the sed command to return only the XML elements we're interested in.

rmunn commented 2 months ago

Oh, and one large project I looked at also proves that you don't always want "CurVernWss - CurPronunWss". It has CurPronunWss being just one writing system which I'll call xyz. The CurVernWss has xyz xyz-Latn-x-majority xyz-Zxxx-x-majority-audio xyz-Latn-x-minority1 xyz-Zxxx-x-minority1-audio xyz-Latn-x-minority2 xyz-Zxxx-x-minority2-audio. But the language code for that project should indeed be xyz.

So the logic we'll want is to just look at CurVernWss and find the language that is shared between all the tags. If there are more than one, we'll want to flag that somehow. Perhaps I'll write a one-time script that goes through every project on the database, throttled to do just one per second or something, and grabs just the CurVernWss list. Then I'll be able to find out how many of them, if any, have more than one writing system in the vernacular list. Hopefully that number will be zero, but let's find out.

rmunn commented 2 months ago

Preliminary results from that research (all actual language codes except en and qaa have been replaced with xyz or aaa or bbb or similar in the list below):

But all those problems were rare. By far the vast majority had vernacular lists that were either a single xyz, or a short list like xyz-fonipa-x-emic xyz-x-variatio xyz qaa-fonipa-x-enIPA.

rmunn commented 2 months ago

Conclusions:

rmunn commented 2 months ago

Decision made at design meeting today: language data that we'll display will be the raw list of tags, with the default vernacular (and default analysis, if we display that) highlighted or bolded.

We've decided to add an isDefault boolean to the object we store in JSON, following the rules of the FLEx data: the first one in the vernacular/analysis list is the default one for that list. The frontend will then be able to apply styles based on whether isDefault is truthy.

rmunn commented 2 months ago

GraphQL queries for writing systems on projects will get a little verbose, but they work:

query Example {
    projects(where: {flexProjectMetadata: {writingSystems: {analysisWss: {some: {tag: {eq: "en"}}}}}}) {
        id
        code
        flexProjectMetadata {
            writingSystems {
                analysisWss {
                    tag
                    isActive
                    isDefault
                }
            }
        }
    }
}