Closed hugolpz closed 3 years ago
Project | Lists |
---|---|
Google/corpuscrawler | 1001 |
UNILEX¹ | 1000 |
[1]: when ca-valencia.txt
accepted.
So at least one is still missing from UNILEX. When identified, please edit then run bash from-corpuscrawler.sh
, then git commit and push. Or tell me the languages to handle and I do. 👍🏼 ⚡
@brawer , I think this pull request is pretty safe. Could you review it ?
@brawer, I plan delete my branch with the valencia data. Please accept or decline this PR before that.
Code to regenerate the valencia unilex data from available googlecrawler data is:
# Run: $ bash add-from-corpuscrawler.sh
# Adapt: change value of `targetFilename`, see list on : github.com/google/corpuscrawler
targetFilename='ca-valencia'
# Download
curl http://www.gstatic.com/i18n/corpora/wordcounts/${targetFilename}.txt -O
# head ${targetFilename}.txt
# Count 'Corpus-Size'
corpusSize=`cat ${targetFilename}.txt | awk -F '\t' '{sum += $1} END {print sum}'`
# Add UNILEX header
echo $'Form Frequency\n\n# SPDX-License-Identifier: Unicode-DFS-2016\n# Corpus-Size: '${corpusSize}$'\n' > tmp.txt
# head tmp.txt
# Swap columns
cat ${targetFilename}.txt | awk -F '\t' 'BEGIN { OFS=FS; NR>5 } { print $2, $1 }' >> tmp.txt
# head tmp.txt
# Format and dispatch to frequency folders (3)
mv tmp.txt ./data/frequency/${targetFilename}.txt
# head ./data/frequency/${targetFilename}.txt
Looks good, but can you sign the Contributor License Agreement?
Done. The script will help to integrate other frequency lists. While I suspect you already have one such script I didn't find it. So my script could be a positive addition ;)
See commits. @brawer