giellalt / giella-core

Build tools and build support files as well as developer support tools for the GiellaLT repositories.
https://giellalt.uit.no
GNU General Public License v3.0
7 stars 2 forks source link

Documentation for new languages #5

Closed aaronfay closed 3 years ago

aaronfay commented 3 years ago

Hello,

I am interested in building FSTs for a new language, is there documentation that will help someone new to get started?

kristiank commented 3 years ago

Hello Aaron, I am not sure what you mean with 'building FSTs for a new language' but I guess you mean the morphology. My MA-thesis developed a system that takes inflection tables as input and uses them as a model from which it automatically creates both the FSTs for the Giella framework and also source code for a morphology module for Grammatical Framework. This way the newly added language gets integrated in both frameworks.

The main idea and concept behind the masters thesis is to enable non-technical users to create and maintain a computational morphology by using a technology neutral representation for the morphology (e.g. inflection tables). Everything concerning the technological aspects happens behind the scenes and all source codes are derived automatically from the inflection tables.

Since the Giella framework has automated the building of a speller based on the FST, then the masters thesis can be used by non-technical users to create also a speller for their language purely by filling a simple inflection table.

Another important aspect of the work was to enable multiple representations on the technological side, e.g. source code for both FST and Grammatical Framework. It also produces a computational description of the morphology in Lexical Markup Framework XML. This enables adding future technologies without more work than to map the computational description to a new source code generator. For the thesis I created also a generator for producing LaTex code of the inflection tables.

The masters thesis is available here and includes an English summary (p. 160).

I am not affiliated with the Giella LT group, but I sympathize strongly with their efforts creating their framework. I also sympathize with other technological groups, and that is the reason behind my 'technology-neutral' approach.

aaronfay commented 3 years ago

Hi @kristiank ,

Yes, please bear with me; I am a language learner and a software engineer however I still lack the language regarding linguistics and these specific projects to speak coherently about my needs at this point.

I have the compiled normative generator and descriptive analyzer FSTs for one Indigenous language here in Canada, and would like to begin work on another (or several). Recognizing the barrier to entry for this technology is pretty high, I would like to learn the process for getting started with generating morphological analysis tools for spell checking etc. for different languages, both so I may contribute to Indigenous language revitalization here in Canada, but also teach others.

Your project does sound very interesting and I have read the English summary. Is any of the tooling you have described available for use currently? I would be interested in trying it to see if it can serve our purposes.

snomos commented 3 years ago

Hello @aaronfay I am one of the architects behind the GiellaLT infrastructure. In addition to having a look at @kristiank 's frameworks and tools, there is also of course the possibility to work directly in the GiellaLT framework, by building lexicons and morphologies in lexc, and morphophonologies in either twolc or using xfst rewrite rules. The main reference documentation for these technologies is the book Finiate State Morphology. There is also also a web site with updated software you can download - the original software comes with the book on a CD. That said, you should rather use Foma or Hfst, both of which are source-code compatible with the Xerox tools described in the book.

The GiellaLT infrastructure can use all of the three technologies, and will use whatever is found on the computer. To build spellers we relly on Hfst and derived technologies. For production spellers we use divvunspell, a Rust reimplemetation of hfst-ospell, and about 10x faster in actual use.

A third alternative (to using @kristiank 's framework or working directly with LexC files in the GiellaLT infra) is to build your lexicon (and possibly also your morphology) in an editing tool outputting xml, which is then transformed to lexc. @rueter is doing that for most - if not all - of the languages he is working on. Conversion from xml is supported directly in the GiellaLT framework. Both @rueter and I can give details about the xml structure if this is an interesting option.

To get started using the GiellaLT infra, have a look at this page. That page assumes there already is a language repository to work on. I can easily set one up for you - and this is needed in any case irrespective of the approach to development you take.

Since you are working with indigenous languages in Canada, you should know that there are a number of other people and institutions already building FST's (analysers, spellers) and other tools for those same languages, many of them using the GiellaLT framework either directly or indirectly (indirectly is kind of how @kristiank works - developing in his own infra, then exporting to GiellaLT). The main person to contact is @aarppe , who is a professor at the University of Alberta, and the head of the language technology work done there through Altlab. Actually, when I checked that site right now, it turns out there is a job opening for a software developer, so if that would be of interest, give it a try.

Finally, what languages are you planning to work on?

Hope this answers your questions, and please continue to ask. The problem is not so much lack of documentation as it is too much and too unorganised documentation. And please tell me what language repo(s) you would like me to set up for you.

aaronfay commented 3 years ago

Hello @snomos

Thank you for the details. I am in touch with Antti at the UofA and we are beginning to collaborate on some projects regarding languages in Canada. I may have jumped the gun reaching out to the project greater like this as I am still learning how all the pieces fit together, it seems that my time might be best spent understanding how to work with foma for the time being though if the tooling described by @kristiank is available and usable today, that would likely be sufficient. We have over 60 Indigenous languages in Canada, and unfortunately work is only being done on a few of them. Were the barrier to entry a little lower regarding some of these technologies, it might be easier to get more people involved, which is the motivation for my enquiry.

I thank you all again for your time, I will dig further into the resources that you have described. 🙏

Trondtr commented 3 years ago

Hello, @aaronfay , just to repeat @snomos ' point: The best infrastructure is the github setup github/giellalt/lang-xxx, and you are welcome to use that. Discussing with @aarppe is definitely a good idea. The 60 languages differ greatly from each other, if trying to decide between them is an issue, there are several factors you may give weight:

kristiank commented 3 years ago

@aaronfay the code is available here but it lacks documentation. The input file is the csv-table file here. On the first line of the table you can specify your own labels. Write to me on kristian (at) keeleleek (dot) ee and I will help get you started so you can see how the generated code looks like so you can decide which approach you want to take.

ftyers commented 3 years ago

Dear @aaronfay I would also add that there is an IRC chat for HFST on freenode, accessible through Matrix if you would like real time support -- we're also familiar with the Giella infrastructure and are enthusiastic about working with indigenous languages of the Americas.