add ca-valencia and tiny update `add-from-corpuscrawler.sh`

hugolpz commented 3 years ago

See commits. @brawer

CLAassistant commented 3 years ago

All committers have signed the CLA.

hugolpz commented 3 years ago

Project	Lists
Google/corpuscrawler	1001
UNILEX¹	1000

[1]: when ca-valencia.txt accepted.

So at least one is still missing from UNILEX. When identified, please edit then run bash from-corpuscrawler.sh, then git commit and push. Or tell me the languages to handle and I do. 👍🏼 ⚡

hugolpz commented 3 years ago

@brawer , I think this pull request is pretty safe. Could you review it ?

hugolpz commented 3 years ago

@brawer, I plan delete my branch with the valencia data. Please accept or decline this PR before that.

Code to regenerate the valencia unilex data from available googlecrawler data is:

# Run:  $ bash add-from-corpuscrawler.sh

# Adapt: change value of `targetFilename`, see list on : github.com/google/corpuscrawler
targetFilename='ca-valencia'

# Download
curl http://www.gstatic.com/i18n/corpora/wordcounts/${targetFilename}.txt -O
# head ${targetFilename}.txt
# Count 'Corpus-Size'
corpusSize=`cat ${targetFilename}.txt | awk -F '\t' '{sum += $1} END {print sum}'`
# Add UNILEX header
echo $'Form Frequency\n\n# SPDX-License-Identifier: Unicode-DFS-2016\n# Corpus-Size: '${corpusSize}$'\n' > tmp.txt
# head tmp.txt
# Swap columns
cat ${targetFilename}.txt | awk -F '\t' 'BEGIN { OFS=FS; NR>5 } { print $2, $1 }' >> tmp.txt
# head tmp.txt
# Format and dispatch to frequency folders (3)
mv tmp.txt ./data/frequency/${targetFilename}.txt
# head ./data/frequency/${targetFilename}.txt

brawer commented 3 years ago

Looks good, but can you sign the Contributor License Agreement?

hugolpz commented 3 years ago

Done. The script will help to integrate other frequency lists. While I suspect you already have one such script I didn't find it. So my script could be a positive addition ;)

unicode-org / unilex

add ca-valencia and tiny update `add-from-corpuscrawler.sh` #13