Closed oterrier closed 3 years ago
Hey there, thanks for reaching out and I'm glad this package helped you out. I was also considering adding more languages and that corpus seems to be a simple and effective corpus. It looks like the average character length of the English corpus is 4309 (run pyate.TermExtraction.get_general_domain().str.len().mean()
) and a quick count of the corpus you sent shows that it's about 200 chars each, so I think around 20 sentences per paragraph could work fairly well.
Another issue I noticed while checking this out is that even the corpus length isn't very consistent as you can see by running
>>> pyate.TermExtraction.get_general_domain().str.len()
0 1290
1 1596
2 34818
3 19387
4 1451
...
295 12545
296 5739
297 16695
298 3691
299 2001
but maybe I will throw another issue for that.
OK perfect I've just converted the french corpus to csv taking a window of 20 sentences as you suggested I have the file default_general_domain.fr.csv ready (attached) Can you check that it works as expected ? If it is OK I can then convert other languages too
Best regards default_general_domain.fr.zip
Olivier
Hey @oterrier, that looks good! It would be awesome if you can convert some of the other commonly used languages as well! Once your down could you place the csv's with the other files and submit a pull request?
I was also thinking if we should set up some sort of CLI for installing additional languages since they are about 5-10 megabytes each (I can imagine the German one is even longer). I'm thinking about something similar to the python -m spacy download ...
type of interface. I think we can consider this once we get there. Thanks once again for your help!
Hi and thx a lot for this great package I am thinking in adding support for more languages (for example: fr, es, de, it, ar, pt, etc...) I have been looking at the Opus Wikipedia corpus available here You can easily download a zip containing a huge list of sentences extracted from Wikipedia like The sentences are located in an xml file:
But I was wondering if the sentences are not too short to be considered as 'paragraphs' Apparently the paragraph used in the english corpus are much longer. Do you think it is worth using this corpus, maybe I could group a set of sentences (10 ?) together to consolidate fake paragraphs? What would be your advice here?
Best regards
Olivier Terrier