nat1881 / cltk

The Classical Language Toolkit
http://cltk.org
MIT License
2 stars 0 forks source link

Apostrophe handling in French word tokenizer #2

Closed diyclassics closed 7 years ago

diyclassics commented 7 years ago

The code...

from cltk.tokenize.word import WordTokenizer('french')
word_tokenizer = WordTokenizer('french')
word_tokenizer.tokenize('Vïande au sel de la salliere N\'atouche, c\'est laide manière.')

...returns:

['Vïande', 'au', 'sel', 'de', 'la', 'salliere', 'N', "'atouche", ',', 'c', "'est", 'laide', 'manière', '.']

Looking at "N'atouche" and "c'est", it seems that the apostrophe might be better kept with "N" and "c", or separated all together. Cf. the discussion of "What is the appropriate tokenization of 'Qu'est-ce que c'est?' on this page under "Tokenization": https://de.dariah.eu/tatom/preprocessing.html. I'll leave it to you to decide what makes the best sense for OF and MF and rewrite the word tokenizer accordingly. Either way, I think an explicit decision about how this word tokenizer should work would be helpful.

diyclassics commented 7 years ago

BTW—it may well be that the OF/MF word tokenizer needs to have parameters to deal with different output, or as with NLTK, multiple tokenizers with different outputs.

nat1881 commented 7 years ago

Thank you for flagging this up! I was thinking of getting rid of all apostrophes at the word level - so spelling out "N'" > "Ne", "c'" > "ce" etc. - elision is pretty consistent before vowels and I think it'll be helpful for lemmatization later.

nat1881 commented 7 years ago

Updated version spells out contracted forms (very messily but works in isolation)

mlj commented 7 years ago

I think we should do what the Syntactic Reference Corpus of Medieval French did in these cases --- no reason to do things differently unless we have a good linguistic and/or computational reason for it. From very cursory inspection of their Roland it looks like they have done what @diyclassics suggests above, e.g.

Tresqu'
en
la
mer
cunquist
la
tere
altaigne

N'
i
ad
castel
ki
*
devant
lui
remaigne
mlj commented 7 years ago

(Your lemmatiser still ought to figure out that Tresqu' is tresque etc., but you'll have to do it as part of the lemmatiser.)

nat1881 commented 7 years ago

Makes sense - changes forthcoming.