proycon / python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
29 stars 5 forks source link

Question: Abbreviations list #9

Open pirolen opened 3 years ago

pirolen commented 3 years ago

What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?

proycon commented 3 years ago

Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)

pirolen commented 3 years ago

Thanks! I have a long list of idiosyncratic and critical-edition-specific abbreviations, I guess most of them are not useful for including them in general, e.g. 'Corp.Inscr. Graec.' or 'Abh. der Sächs. Ges.d.Wiss.’ or S.A. (=Sonnenaufgang).

Sure I will suggest if generic ones emerge.

Bigger context of the question: for 'Abbreviations with multiple periods’ in FoLiA documents, I should make sure that

I was wondering how to best use ucto to this end.

The language is mostly German and Latin, so I set in the config for German and (as a hopeful fallback) to French.

On 10. May 2021, at 13:35, Maarten van Gompel @.***> wrote:

Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pirolen commented 3 years ago

How is "z.B." supposed to be treated by ucto? It is not part of the list in https://github.com/LanguageMachines/uctodata/blob/master/config/tokconfig-deu,
and is tokenized with the classes "ABBREVIATION-KNOWN" plus "INITIAL".

Isn't it supposed to be somehow a unit and an instance of 'Abbreviations with multiple periods’? Maybe @kosloot can tell why, I don't know ucto yet that much.

 <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.43" class="ABBREVIATION-KNOWN" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">z.</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.44" class="INITIAL" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">B.</t>
          </w>
kosloot commented 3 years ago

That all depends on the ucto rules for German.

In this case z.B. is NOT a known abbreviation. But 'z' is. So 'z.' is tagged as an abbreviation.

Also there is a rule for interpret an Uppercase B. as an Initial:

#retain initials
INITIAL=^(?:\p{Lt}|\p{Lu})\.$

Hence the split into 2 words.

Adding z.B. to the abbreviation list should be enough. My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system. ('make install')

If you are satisfied, a pull request is welcome

kosloot commented 3 years ago

To make life easier, I separated the German abbreviations into a separate file you can edit: deu.abr

pirolen commented 3 years ago

OK, for the pull request I would only add generic abbreviations, right? (Attached an illustration.)

The multi-element abbreviations like d.h. or z.B. need to be in the list as d\.h and z\.B?

Screenshot 2021-05-12 at 12 36 28

kosloot commented 3 years ago

Ok, I am a bit surprised now, as I tested it myself with the current setup:

ucto -Ldeu -v
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured for languages: [deu]
ucto> Dass ist z.B. falsch . <utt> 
Dass    WORD    BEGINOFSENTENCE NEWPARAGRAPH 
ist WORD    
z.B.    ABBREVIATION    
falsch  WORD    
.   PUNCTUATION ENDOFSENTENCE 

And:

ucto> d.h. dass es gut geht?
d.h.    ABBREVIATION    BEGINOFSENTENCE 
dass    WORD    
es  WORD    
gut WORD    
geht    WORD    NOSPACE 
?   PUNCTUATION ENDOFSENTENCE 

So it should work out of the box for those examples.

So there must be some other problem here?? I would need more context.

Despite that: more abbreviations are welcome. (you may also send me a file) Remember that abbreviations that can be mistaken with real words must be marked. Like : dass\.1 and NOT dass

pirolen commented 3 years ago

I run the LaMachine command-line ucto (not python-ucto) with --uselanguages=deu,fra when I got the above tokenization (https://github.com/proycon/python-ucto/issues/9#issuecomment-836635858).

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

kosloot commented 3 years ago

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

AH! Well it cannot be solved by Ucto in that case. A space is the major (and only unchangeable) separator between tokens in Ucto.

Sorry for that limitation.

pirolen commented 3 years ago

Any type of space (e.g. short (h)space too) counts as a separator?

kosloot commented 3 years ago

I assume so. But determining spaces is a hell :{ Ucto uses the ICU:u_isspace() function to do so. see: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a48dd198b451e691cf81eb41831474ddc

pirolen commented 3 years ago

OK, good to know.

Back to the deu.abr list: "d.h." is not in this list. How come the command line ucto covers it?

kosloot commented 3 years ago

That is covered by the ABBREVIATION rule in the 'tokconfig-deu' file

Abbreviations with multiple periods

ABBREVIATION=^(?:[{([\<]?)(\p{L}{1,3}(?:.\p{L}{1,3})+.?)(?:\Z|[,:;})]>])

This regexp says something along the line: dot-separated sequences of 1-3 characters are considered an abbreviation; even when placed between brackets like '{ }' '[ ]' or '( )'. And optionally ending with a ',' ':' or ';'

This will catch 'z.B'. and 'd.h.' And also '(z.B'. and {'d.h.}' . But NOT 'z. B.' or 'd. h.'

pirolen commented 3 years ago

I suggested some items for inclusion in the German abbrev list: https://github.com/LanguageMachines/uctodata/compare/master...pirolen:patch-1

(several are actually of Latin origin...)

kosloot commented 3 years ago

They seem OK to me. So I merged them

pirolen commented 3 years ago

Adding z.B. to the abbreviation list should be enough. My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system. ('make install')

I'd like to supply custom abbreviations for python-ucto in my dev LaMachine. Would it work to add them to lmdev/src/uctodata/config/deu.abr and some action to refresh the tool?

proycon commented 3 years ago

Simply adding them to deu.abr should work yes, but those changes may be overwritten on LaMachine update again. Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself. No need to refresh the tool, the data will be loaded dynamically when the tokeniser binding instantiates.

pirolen commented 3 years ago

Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself.

Thanks! The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu? Would it (also) work to refer to my own abbrev list? e.g.: [ABBREVIATIONS] %include my-deu

Yet another Q: (Where) Would it be possible to address phrases such as "der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird"? I added Jhs, Chr and v.Chr to deu.abr. In tokconfig-deu, the line of #NUMBER-ORDINAL is commented out. If I uncomment it, the sentence still gets split after '1.' and after 'Jhs.' .

(I would be happy to add more abbrevs to https://github.com/LanguageMachines/uctodata/blob/master/config/deu.abr, but some are really domain-specific and I am not sure how much of that you'd like to have.)

kosloot commented 3 years ago

Well: tokconfog-deu will be overwritten on a LaMachine update too. so:

The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu? Would it (also) work to refer to my own abbrev list? e.g.: [ABBREVIATIONS] %include my-deu

Yes that's the way to do it. And you should run Ucto using the '-c' option to refer to your own config:

ucto -c my-tokconfig-deu ...

Your other question needs some more thinking. What exactly would you like to come out? Keeping 1. Jhs. together as one token?

pirolen commented 3 years ago

Thanks! For the current use case I only need sentence segmentation, and so that the sentence does not get chopped up due to these abbreviations (the entire sentence here is: "Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war.".)

In fact, it would be nice if there would be an option to access all sentences without tokenization. To achieve this, after getting the sentences from the wrapper by tokenizer.sentences(), I am simply re-joining the punctuation marks with the tokens, using string.punctuation... :-o

kosloot commented 3 years ago

OK, so your problem is that this utterance it is split into 3 sentences. hmm. That might be not that easy.... In fact quite hard.

In fact, it would be nice if there would be an option to access all sentences without tokenization.

Well for NON-FoLiA files there is the undocumented --split option which does this. But still would give 3 sentences on this input:

$ more piro
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird u
nd besonders als Regenmacher bekannt war.

$ ucto -Ldeu piro --split
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. <utt> 
Jhs. <utt> 
v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war. <utt> 

Without --split:

$ ucto -Ldeu piro 

Im Talmud erwähnter charismatischer Wundermann , der dort in die Zeit des 1 . <utt> Jhs . <utt> v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war . <utt> 

But even if we fix sentence splitting to get juts one sentence, this won't work for FoLiA files (not implemented at all).

kosloot commented 3 years ago

@pirolen A quick fix might be this:

In your my-tokconfig-deu file replace:

#retain digits, including those starting with initial period (.22), and negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)

by

#retain digits, including those starting with initial period (.22) or ending with a period (1.), and  also negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)(?:[\.])?

AND: be sure to add Jhs. to your abbreviation list.

This seems to do the trick:

$ ucto -c my-deu piro 
die Zeit des 1. Jhs. v.Chr. eingeordnet <utt> 

I'm not sure if this will disturb tokenization otherwise, on a first glance all seems OK

One thing that might bite you: A sentence ending on such a number will no longer be detected as such. So: "Siehe Seite 5. Alles Gute" will be 1 sentence.

pirolen commented 3 years ago

Thanks!! I need to stay in FoLiA.

For me the renamed/customised configfile does not work, neither with ucto -c with on CLI as in your example, nor for python-ucto in a script as usual, i.e.

configurationfile = "my-tokconfig-deu" tokenizer = ucto.Tokenizer(configurationfile)

In both cases I get: ucto:Unable to open configfile: ucto:Cannot read Tokenizer settingsfile my-tokconfig-deu ucto:Unsupported language? (Did you install the uctodata package?)

:-(

kosloot commented 3 years ago

Hmm, works for me on the command-line. So maybe a LaMachine oddity?

ucto -c my-tokconfig.deu 
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured from file: my-tokconfig.deu
ucto> Siehe Seite 5. Alles Gute
Siehe Seite 5. Alles Gute <utt> 
pirolen commented 3 years ago

Or I don't know something about the extension convention? In LM there is no .deu extension if I see it well.

kosloot commented 3 years ago

not the extension doesn't matter. the abbreviation file should have an extension .abr though.

pirolen commented 3 years ago

Using the custom config file with the command line ucto in LaMachine works if:

pirolen commented 3 years ago

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

kosloot commented 3 years ago

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

Yes these are different. You can change that in ABBREVIATION-KNOWN rule, using 'ignore case ((?i))' in the REGEXP. That would render ALL abbreviations case insensitive.