Closed bguil closed 2 years ago
I would say, if we are to follow the (unofficial) PMB Manual and the Towards Universal Semantic Tagging paper, then the words which should have different semantic tags, should also be separate tokens.
This made me wonder if, e.g. Buckingham Palace would also be one token. Probably yes (see 23/1396, where "Great Pyramid" is also one token). But then again, in French, it is palais de Buckingham, which sounds more like "the palace Buckingham / the palace that is named Buckingham" rather than "the building/ that is named Buckingham Palace".
I don't know how to treat "il y a", I leave this to the native speakers for now :)
I have created a potato tokeniser for the easy cases. For these difficult cases I'll follow the discussion.
I was looking for a list of French titles to implement and I found this website. Is it useful, or should I look for information elsewhere?
Also, what is the limitation in terms of covering the titles for PROPN? I was about to pick the general ones
{"Monsieur", "Madame", "Mademoiselle"}
and their acronyms
{"M.", "Mme", "Mlle"}
only.
For the rules which uses lexical information, this information should be considered as a parameter that can be adapted later on.
For French titles, we can start with the list of the website and adapt it later if needed.
I've decided the tokenisation of the remaining cases, following @siyanapavlova's comment. For the last case (il y a), I've choosen:
il_y_a
il y a
(note that in this case, we can have negation (il n'y a pas) or another tense (il y avait).In the 163 sentences, we have 2 examples for both cases:
Yunus a fondé la banque Grameen il_y_a 30 ans .
Il y a aussi des touristes français .
Il y a un biscuit sous la table .
Marilyn_Monroe est morte il_y_a 33 ans .
Here are the cases where I'm not sure about how to tokenise. Should we split or not at the
*
position? @maxamb: your opinion?M. + Proper name
*
Curtis .*
Brown .*
Zhao est mort à Pékin .Dates
*
1902 .*
août .Appositions
*
Scott a découvert l' île*
Scott en décembre_1902 .*
Cortland .Il y a
*
y*
a 30 ans .*
y*
a 33 ans .