gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

Suggestion for extracting CNRTL Est Républicain Corpus #99

Open tattorba87 opened 4 years ago

tattorba87 commented 4 years ago

Instead of using:

xmllint --xpath '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee/.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt

this seems to work better:

xmlstarlet sel -t -v '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' Annee/.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g ; chomp' > est_republicain.txt

As xmllint was replacing several French characters with their hex format. xmlstarlet doesn't seem to have this issue

tattorba87 commented 4 years ago

Or even better:

xmlstarlet sel -t -m '//[local-name()="div"][@type="article"]//[local-name()="p" or local-name()="head"]/text()' -n --var linebreak -n --break -v "translate(., \$linebreak, '')" Annee/.xml | perl -pe 's/^ +//g ; s/^ (.+)/$1\n/g; s/ +/ /g' > est_republicain.txt