oterrier commented 3 years ago

Hi and thx a lot for this great package I am thinking in adding support for more languages (for example: fr, es, de, it, ar, pt, etc...) I have been looking at the Opus Wikipedia corpus available here You can easily download a zip containing a huge list of sentences extracted from Wikipedia like The sentences are located in an xml file:

<s id="1">L'algèbre générale, ou algèbre abstraite, est la branche des mathématiques qui porte principalement sur l'étude des structures algébriques et de leurs relations.</s>
<s id="2">Elle maintient son activité dans les deux Irlandes (État libre d'Irlande, indépendant, et Irlande du Nord, britannique), mais concentre son action sur les intérêts britanniques, surtout en Irlande du Nord.</s>
<s id="3">Il a formé toute une génération de linguistes français, parmi lesquels Émile Benveniste, Marcel Cohen, Georges Dumézil, André Martinet, Aurélien Sauvageot, Lucien Tesnière, Joseph Vendryes, ainsi que le japonisant Charles Haguenauer.</s>
<s id="4">En conséquence, Meillet présente Parry à Matija Murko, savant originaire de Slovénie qui avait longuement écrit sur la tradition héroïque épique dans les Balkans, surtout en Bosnie-Herzégovine.</s>

But I was wondering if the sentences are not too short to be considered as 'paragraphs' Apparently the paragraph used in the english corpus are much longer. Do you think it is worth using this corpus, maybe I could group a set of sentences (10 ?) together to consolidate fake paragraphs? What would be your advice here?

Best regards

Olivier Terrier

kevinlu1248 commented 3 years ago

Hey there, thanks for reaching out and I'm glad this package helped you out. I was also considering adding more languages and that corpus seems to be a simple and effective corpus. It looks like the average character length of the English corpus is 4309 (run pyate.TermExtraction.get_general_domain().str.len().mean()) and a quick count of the corpus you sent shows that it's about 200 chars each, so I think around 20 sentences per paragraph could work fairly well.

Another issue I noticed while checking this out is that even the corpus length isn't very consistent as you can see by running

>>> pyate.TermExtraction.get_general_domain().str.len()
0       1290
1       1596
2      34818
3      19387
4       1451
       ...  
295    12545
296     5739
297    16695
298     3691
299     2001

but maybe I will throw another issue for that.

oterrier commented 3 years ago

OK perfect I've just converted the french corpus to csv taking a window of 20 sentences as you suggested I have the file default_general_domain.fr.csv ready (attached) Can you check that it works as expected ? If it is OK I can then convert other languages too

Best regards default_general_domain.fr.zip

Olivier

kevinlu1248 commented 3 years ago

Hey @oterrier, that looks good! It would be awesome if you can convert some of the other commonly used languages as well! Once your down could you place the csv's with the other files and submit a pull request?

I was also thinking if we should set up some sort of CLI for installing additional languages since they are about 5-10 megabytes each (I can imagine the German one is even longer). I'm thinking about something similar to the python -m spacy download ... type of interface. I think we can consider this once we get there. Thanks once again for your help!

oterrier commented 3 years ago

kevinlu1248 / pyate

Adding more languages #45

46 Done!