Est_republicaine Corpus not found

gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition

GNU Lesser General Public License v3.0

443 stars 84 forks source link

Est_republicaine Corpus not found #110

Open BaderEddineB opened 4 years ago

BaderEddineB commented 4 years ago

Hello I'm trying to download the est_republicaine corpus to train the French language model using KenLM, when I click on the link, it gives me this error "nginx error! The page you are looking for is not found" any ideo, where can have this corpus ? thanks

svenha commented 4 years ago

This seems to be a problem of https://cnrtl.fr/ . I just mailed them a bug report.

BaderEddineB commented 4 years ago

Ok thanks, I just found another download link, is this one: ( https://repository.ortolang.fr/api/content/export?&path=/est_republicain/4/&filename=est_republicain&scope=YW5vbnltb3Vz3 ) I would like to know if it is the same as that of cnrtl.fr ?

BaderEddineB commented 4 years ago

est_repeb2

svenha commented 4 years ago

Someone from cnrtl.fr answered my question. The official new web site for this corpus is https://www.ortolang.fr/market/corpora/est_republicain Version 4 from 2020-07-22 is the latest.

BaderEddineB commented 4 years ago

Thank you very much, it looks a bit like the one i found (the pictures above). but when I run ["xmllint --xpath '// [local-name () =" div "] [@ type =" article "] // [local-name () =" p "or local-name () = "head"] / text () 'Year / . xml | perl -pe' s / ^ + // g; s / ^ (. +) / $ 1 \ n / g; chomp '> est_republicain. txt "] to extract the titles and paragraphs in the text file" est_republicain.txt ". I see that the pulling is not going well

here is the example of the "est_republicain.txt" file result: Capturekk

is it normal ? What is the problem ?

pguyot commented 2 years ago

The file format might have been changed. The idea is to extract text only and what you get is nearly what we need. You need to replace all sgml entities.

See https://serverfault.com/questions/440805/how-can-i-easily-convert-html-special-entities-from-a-standard-input-stream-in-l