proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Python issues: Splitting long text by folia2txt and FLAT in the custom software #106

Open osherenko opened 1 year ago

osherenko commented 1 year ago

1) I've installed folia-utils and used the "folia2txt -s ..." from CLI to split a long string in sentences. Unfortunately, if I split the old Slavonic text "Искони бе Слово и Слово бе отъ Бога. и Богъ бе слово." in sentences I get the wrong answer Искони бе Слово и Слово бе отъ Бога. и Богъ бе слово. If I split an English text, it works just fine.  2) Is it possible to run FLAT not as a tab in an internet browser, but as a PySide widget? BTW, I can't import folia2html from the foliatools package in my Python script as I did with foliatools.folia2txt, foliatools.foliafreqlist, foliatools.foliatree. Nevertheless, I can run it from the CLI by "python.exe foliatools\folia2txt.py -s myannotation.xml"

proycon commented 1 year ago
  1. I've installed folia-utils and used the "folia2txt -s ..." from CLI to split a long string in sentences.

folia2txt -s is not a proper sentence splitter, it simply assumes each line of a text file is already its own sentence!

For an actual tokeniser and sentence splitter with rich FoLiA support, consider ucto: https://github.com/LanguageMachines/ucto Although it has no specific rules for Old Church Slavonic, but you can use the generic ruleset (named generic) or the russian one tokconfig-rus).

  1. Is it possible to run FLAT not as a tab in an internet browser, but as a PySide widget?

I hadn't heard of these until now so I don't know. I suppose if there's such a qt widget which holds a whole web browser, then yes.

BTW, I can't import folia2html from the foliatools package in my Python script as I did with foliatools.folia2txt, foliatools.foliafreqlist, foliatools.foliatree. Nevertheless, I can run it from the CLI by "python.exe foliatools\folia2txt.py -s myannotation.xml"

Hmm.. I see.. that should be probably be improved yes.