Open BaderEddineB opened 4 years ago
This seems to be a problem of https://cnrtl.fr/ . I just mailed them a bug report.
Ok thanks, I just found another download link, is this one: ( https://repository.ortolang.fr/api/content/export?&path=/est_republicain/4/&filename=est_republicain&scope=YW5vbnltb3Vz3 ) I would like to know if it is the same as that of cnrtl.fr ?
Someone from cnrtl.fr answered my question. The official new web site for this corpus is https://www.ortolang.fr/market/corpora/est_republicain Version 4 from 2020-07-22 is the latest.
Thank you very much, it looks a bit like the one i found (the pictures above). but when I run ["xmllint --xpath '// [local-name () =" div "] [@ type =" article "] // [local-name () =" p "or local-name () = "head"] / text () 'Year / . xml | perl -pe' s / ^ + // g; s / ^ (. +) / $ 1 \ n / g; chomp '> est_republicain. txt "] to extract the titles and paragraphs in the text file" est_republicain.txt ". I see that the pulling is not going well
here is the example of the "est_republicain.txt" file result:
is it normal ? What is the problem ?
The file format might have been changed. The idea is to extract text only and what you get is nearly what we need. You need to replace all sgml entities.
Hello I'm trying to download the est_republicaine corpus to train the French language model using KenLM, when I click on the link, it gives me this error "nginx error! The page you are looking for is not found" any ideo, where can have this corpus ? thanks