Localization data and XML files discussion / temporary documentation

giellatekno / fst-web-interface

Frontend for language tools

https://giellatekno.github.io/fst-web-interface

0 stars 0 forks source link

Localization data and XML files discussion / temporary documentation #2

Open Phaqui opened 1 year ago

Phaqui commented 1 year ago

My current understanding of this topic is as follows:

Apache Forrest builds html pages from XML and XSLT. All localization data is stored in one set of XML files, and the XSLT "scripts" works like a program that weaves in data from those XML files, and outputs HTML. The XSLT files also therefore contains HTML markup for how the resulting HTML document will look. The understanding so far, is that when we move away from Apache Forrest, we do not need these XSLT files anymore, but they will be useful for reference.

That leaves us with something that is a big part of this project: Parse out localization data from the XML files, so that we don't have to do all of that manually. Some question still remains:

Figure out the exact structure. Subsequently figure out how to apply that structure to this newly built site.
Should the built application parse the localization data when running, or should there be a build step to do it ahead of time?

albbas commented 1 year ago

Maybe svelte-i18n can be used to handle the localisation?

Phaqui commented 1 year ago

Yes, I think it would be suitable. Thought I am using it in the form of svelte-intl-precompile. The API is more or less the same, but with promises of faster performance. Free performance with the same API and functionality? Always saying yes that!

The main problem for me now is taking the translation files written as XML, and converting them to JSON for use in the application. The trouble is coming up with a generic conversion script, which can handle changes and/or additions to the XML files.

albbas commented 1 year ago

After having had a look at the cgi-<translation-lang>.xml files, it seems as you could do something like this:

the lang attribute indicates what language the tool is made for
the tool attribute indicates which part of the app the message belongs to
the xml tag gives an id

So, by extracting the messages by tool, translation-lang and attribute-lang, you could make e.g. <attribute-lang>-<this_tool>-translation-lang>.json files where the content becomes:

{
    "tag1-that-has-this_tool-attribute": "tag1-that-has-this_tool-attribute.text",
    ...
    ...
    "tagx-that-has-this_tool-attribute": "tagx-that-has-this_tool-attribute.text"
}

The extraction can be done everytime the appropriate .xml file is committed

Phaqui commented 1 year ago

The python script xmlparsing/xmltojson.py takes as input an xml file, and outputs a flat dictionary with keys as dot-separated strings of the tag and all attributes of each element, and the value as the corresponding localization string. There is some custom logic to not "fold down" data which includes html tags that should be part of the html shown on the page. This includes <code>, <a>, etc.

That was probably not a great explanation. Running it for oneself will hopefully make it a lot more clear:

cd xmlparsing
python xmltojson.py cgi-eng.xml

Now, the structure of the key strings are not special-casing the tool attribute, for example, but as long as all keys can remain unique in this way, that really should not be an issue, unless I'm missing something.

Obviously this is a work in progress, and work that remains to be done would be parsing the xsl files to find out which tools are available for which languages, among other things.