Open pirolen opened 3 years ago
Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld
). Contributions there are welcome of course! (you can just send a pull request)
Thanks! I have a long list of idiosyncratic and critical-edition-specific abbreviations, I guess most of them are not useful for including them in general, e.g. 'Corp.Inscr. Graec.' or 'Abh. der Sächs. Ges.d.Wiss.’ or S.A. (=Sonnenaufgang).
Sure I will suggest if generic ones emerge.
Bigger context of the question: for 'Abbreviations with multiple periods’ in FoLiA documents, I should make sure that
I was wondering how to best use ucto to this end.
The language is mostly German and Latin, so I set in the config for German and (as a hopeful fallback) to French.
On 10. May 2021, at 13:35, Maarten van Gompel @.***> wrote:
Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
How is "z.B." supposed to be treated by ucto?
It is not part of the list in https://github.com/LanguageMachines/uctodata/blob/master/config/tokconfig-deu,
and is tokenized with the classes "ABBREVIATION-KNOWN" plus "INITIAL".
Isn't it supposed to be somehow a unit and an instance of 'Abbreviations with multiple periods’? Maybe @kosloot can tell why, I don't know ucto yet that much.
<w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.43" class="ABBREVIATION-KNOWN" set="tokconfig-deu" textclass="OCR">
<t class="OCR">z.</t>
</w>
<w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.44" class="INITIAL" set="tokconfig-deu" textclass="OCR">
<t class="OCR">B.</t>
</w>
That all depends on the ucto rules for German.
In this case z.B. is NOT a known abbreviation. But 'z' is. So 'z.' is tagged as an abbreviation.
Also there is a rule for interpret an Uppercase B. as an Initial:
#retain initials
INITIAL=^(?:\p{Lt}|\p{Lu})\.$
Hence the split into 2 words.
Adding z.B. to the abbreviation list should be enough. My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system. ('make install')
If you are satisfied, a pull request is welcome
To make life easier, I separated the German abbreviations into a separate file you can edit: deu.abr
OK, for the pull request I would only add generic abbreviations, right? (Attached an illustration.)
The multi-element abbreviations like d.h. or z.B. need to be in the list as d\.h
and z\.B
?
Ok, I am a bit surprised now, as I tested it myself with the current setup:
ucto -Ldeu -v
ucto: inputfile =
ucto: outputfile =
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured for languages: [deu]
ucto> Dass ist z.B. falsch . <utt>
Dass WORD BEGINOFSENTENCE NEWPARAGRAPH
ist WORD
z.B. ABBREVIATION
falsch WORD
. PUNCTUATION ENDOFSENTENCE
And:
ucto> d.h. dass es gut geht?
d.h. ABBREVIATION BEGINOFSENTENCE
dass WORD
es WORD
gut WORD
geht WORD NOSPACE
? PUNCTUATION ENDOFSENTENCE
So it should work out of the box for those examples.
So there must be some other problem here?? I would need more context.
Despite that: more abbreviations are welcome. (you may also send me a file)
Remember that abbreviations that can be mistaken with real words must be marked.
Like :
dass\.1
and NOT dass
I run the LaMachine command-line ucto (not python-ucto) with --uselanguages=deu,fra when I got the above tokenization (https://github.com/proycon/python-ucto/issues/9#issuecomment-836635858).
In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".
In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".
AH! Well it cannot be solved by Ucto in that case. A space is the major (and only unchangeable) separator between tokens in Ucto.
Sorry for that limitation.
Any type of space (e.g. short (h)space too) counts as a separator?
I assume so. But determining spaces is a hell :{ Ucto uses the ICU:u_isspace() function to do so. see: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a48dd198b451e691cf81eb41831474ddc
OK, good to know.
Back to the deu.abr list: "d.h." is not in this list. How come the command line ucto covers it?
That is covered by the ABBREVIATION rule in the 'tokconfig-deu' file
Abbreviations with multiple periods
ABBREVIATION=^(?:[{([\<]?)(\p{L}{1,3}(?:.\p{L}{1,3})+.?)(?:\Z|[,:;})]>])
This regexp says something along the line: dot-separated sequences of 1-3 characters are considered an abbreviation; even when placed between brackets like '{ }' '[ ]' or '( )'. And optionally ending with a ',' ':' or ';'
This will catch 'z.B'. and 'd.h.' And also '(z.B'. and {'d.h.}' . But NOT 'z. B.' or 'd. h.'
I suggested some items for inclusion in the German abbrev list: https://github.com/LanguageMachines/uctodata/compare/master...pirolen:patch-1
(several are actually of Latin origin...)
They seem OK to me. So I merged them
Adding z.B. to the abbreviation list should be enough. My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system. ('make install')
I'd like to supply custom abbreviations for python-ucto in my dev LaMachine. Would it work to add them to lmdev/src/uctodata/config/deu.abr and some action to refresh the tool?
Simply adding them to deu.abr
should work yes, but those changes may be overwritten on LaMachine update again. Alternatively you could make your own ucto configuration (copy tokconfig-deu
) and refer to an abbreviation file of yourself. No need to refresh the tool, the data will be loaded dynamically when the tokeniser binding instantiates.
Alternatively you could make your own ucto configuration (copy
tokconfig-deu
) and refer to an abbreviation file of yourself.
Thanks!
The copy should be still called tokconfig-deu
(and saved somewhere)? Or renamed as e.g. my-tokconfig-deu
?
Would it (also) work to refer to my own abbrev list? e.g.:
[ABBREVIATIONS]
%include my-deu
Yet another Q: (Where) Would it be possible to address phrases such as "der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird"? I added Jhs, Chr and v.Chr to deu.abr. In tokconfig-deu, the line of #NUMBER-ORDINAL is commented out. If I uncomment it, the sentence still gets split after '1.' and after 'Jhs.' .
(I would be happy to add more abbrevs to https://github.com/LanguageMachines/uctodata/blob/master/config/deu.abr, but some are really domain-specific and I am not sure how much of that you'd like to have.)
Well: tokconfog-deu will be overwritten on a LaMachine update too. so:
The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu? Would it (also) work to refer to my own abbrev list? e.g.: [ABBREVIATIONS] %include my-deu
Yes that's the way to do it. And you should run Ucto using the '-c' option to refer to your own config:
ucto -c my-tokconfig-deu ...
Your other question needs some more thinking. What exactly would you like to come out? Keeping 1. Jhs.
together as one token?
Thanks! For the current use case I only need sentence segmentation, and so that the sentence does not get chopped up due to these abbreviations (the entire sentence here is: "Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war.".)
In fact, it would be nice if there would be an option to access all sentences without tokenization. To achieve this, after getting the sentences from the wrapper by tokenizer.sentences(), I am simply re-joining the punctuation marks with the tokens, using string.punctuation... :-o
OK, so your problem is that this utterance it is split into 3 sentences. hmm. That might be not that easy.... In fact quite hard.
In fact, it would be nice if there would be an option to access all sentences without tokenization.
Well for NON-FoLiA files there is the undocumented --split option which does this. But still would give 3 sentences on this input:
$ more piro
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird u
nd besonders als Regenmacher bekannt war.
$ ucto -Ldeu piro --split
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. <utt>
Jhs. <utt>
v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war. <utt>
Without --split:
$ ucto -Ldeu piro
Im Talmud erwähnter charismatischer Wundermann , der dort in die Zeit des 1 . <utt> Jhs . <utt> v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war . <utt>
But even if we fix sentence splitting to get juts one sentence, this won't work for FoLiA files (not implemented at all).
@pirolen A quick fix might be this:
In your my-tokconfig-deu file replace:
#retain digits, including those starting with initial period (.22), and negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)
by
#retain digits, including those starting with initial period (.22) or ending with a period (1.), and also negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)(?:[\.])?
AND: be sure to add Jhs.
to your abbreviation list.
This seems to do the trick:
$ ucto -c my-deu piro
die Zeit des 1. Jhs. v.Chr. eingeordnet <utt>
I'm not sure if this will disturb tokenization otherwise, on a first glance all seems OK
One thing that might bite you: A sentence ending on such a number will no longer be detected as such. So: "Siehe Seite 5. Alles Gute" will be 1 sentence.
Thanks!! I need to stay in FoLiA.
For me the renamed/customised configfile does not work, neither with ucto -c with on CLI as in your example, nor for python-ucto in a script as usual, i.e.
configurationfile = "my-tokconfig-deu" tokenizer = ucto.Tokenizer(configurationfile)
In both cases I get: ucto:Unable to open configfile: ucto:Cannot read Tokenizer settingsfile my-tokconfig-deu ucto:Unsupported language? (Did you install the uctodata package?)
:-(
Hmm, works for me on the command-line. So maybe a LaMachine oddity?
ucto -c my-tokconfig.deu
ucto: inputfile =
ucto: outputfile =
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured from file: my-tokconfig.deu
ucto> Siehe Seite 5. Alles Gute
Siehe Seite 5. Alles Gute <utt>
Or I don't know something about the extension convention? In LM there is no .deu extension if I see it well.
not the extension doesn't matter. the abbreviation file should have an extension .abr though.
Using the custom config file with the command line ucto in LaMachine works if:
the custom abbreviation list is referenced from the config file (e.g. 'my-tokconfig.deu') using the full path, e.g. [ABBREVIATIONS] %include /home/ubuntu/lama/src/uctodata/config/my-deu.abr
and the '-c' option is used to refer to 'my-tokconfig.deu' with its full filepath
and the --uselanguages option is not specified.
Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?
Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?
Yes these are different. You can change that in ABBREVIATION-KNOWN rule, using 'ignore case ((?i)
)' in the REGEXP.
That would render ALL abbreviations case insensitive.
What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?