ianmackinnon / inflect

Human language inflection data
16 stars 3 forks source link

Interpretation of repeated infinitives? #1

Open mitchblank opened 4 years ago

mitchblank commented 4 years ago

The french-verb-conjugation.csv file has places where the same infinitive (column 1) appears multiple times in the table:

$ cut -d, -f1 french-verb-conjugation.csv | grep . | sort | uniq -c | awk '$1>1' | wc -l
      14

There seem to be multiple sources of this.

First, there are 4 cases where 100% identical lines appear in the CSV file:

% sort < french-verb-conjugation.csv | uniq -c | sort -nr | grep -v ' 1 ' | cut -d, -f1
   2 tomber
   2 recroître
   2 dédoubler
   2 croître

Those are easy to ignore.

The remaining ones are cases where the same infinitive appears, but the rest of the verbs include a prefix. For example, there is a normal entry for pouvoir but another line that is the entry I would expect for repouvoir:

% grep '^pouvoir,' french-verb-conjugation.csv | cut -d, -f1-9
pouvoir,pouvant,pouvant,pu,avoir,peux,peux,peut,pouvons
pouvoir,repouvant,repouvant,repu,avoir,repeux,repeux,repeut,repouvons

There doesn't seem to be a normal entry for repouvior in the CSV, so it seems that the prefix just is stripped from the infinitive form?

That pattern seems to hold for four of the other duplicated infinitives (including moudre which appears as an infinitive 3 times)

I am far from a native French speaker, but this looks strange to me. Other verbs with prefixes have the infinitive prefixed as well (there are over 300 re- infinitives in the file, for example) so I don't know why these repeat the infinitive.

Then there are 6 other cases where an infinitive appears twice with actually different data:

W1Real commented 1 month ago

I wouldn't trust a lot this dataset, there is no info how he generated it. There is some dataset out there created by a trusted French institute, but I forgot the source right now. It would just need some reformatting.