thebugcreator / ul-pmb

This is part of the French PMB project that deals with text tokenisation and semantic tagging.
3 stars 3 forks source link

Difficult cases for French tokenisation #1

Closed bguil closed 2 years ago

bguil commented 2 years ago

Here are the cases where I'm not sure about how to tokenise. Should we split or not at the * position? @maxamb: your opinion?

M. + Proper name

Dates

Appositions

Il y a

siyanapavlova commented 2 years ago

I would say, if we are to follow the (unofficial) PMB Manual and the Towards Universal Semantic Tagging paper, then the words which should have different semantic tags, should also be separate tokens.

M. + Proper Name

Dates

Appositions

This made me wonder if, e.g. Buckingham Palace would also be one token. Probably yes (see 23/1396, where "Great Pyramid" is also one token). But then again, in French, it is palais de Buckingham, which sounds more like "the palace Buckingham / the palace that is named Buckingham" rather than "the building/ that is named Buckingham Palace".

Il y a

I don't know how to treat "il y a", I leave this to the native speakers for now :)

thebugcreator commented 2 years ago

I have created a potato tokeniser for the easy cases. For these difficult cases I'll follow the discussion.

thebugcreator commented 2 years ago

I was looking for a list of French titles to implement and I found this website. Is it useful, or should I look for information elsewhere? Also, what is the limitation in terms of covering the titles for PROPN? I was about to pick the general ones {"Monsieur", "Madame", "Mademoiselle"} and their acronyms {"M.", "Mme", "Mlle"} only.

bguil commented 2 years ago

For the rules which uses lexical information, this information should be considered as a parameter that can be adapted later on.

For French titles, we can start with the list of the website and adapt it later if needed.

bguil commented 2 years ago

I've decided the tokenisation of the remaining cases, following @siyanapavlova's comment. For the last case (il y a), I've choosen:

  1. If it used as a frozen expression like in il y a 30 ans [30 years ago], then it is one token il_y_a
  2. If it used as a translation of there is, then it is "compositional" and contains 3 tokens: il y a (note that in this case, we can have negation (il n'y a pas) or another tense (il y avait).

In the 163 sentences, we have 2 examples for both cases:

Yunus a fondé la banque Grameen il_y_a 30 ans .
Il y a aussi des touristes français .
Il y a un biscuit sous la table .
Marilyn_Monroe est morte il_y_a 33 ans .