Open hugolpz opened 1 year ago
@pamputt, which file format do you prefer to replace your listing lingualibre languages (lili qid) to feed the bot ? Json, csv, tsv ?
I would prefer to save both the Qid and the number pf recordings, so all those languages over 50k records get divided sparql queries.
The less complicated the better, so TSV if the best (or CSV), but definitely not JSON.
EDIT: Outch. Default Blazegraph API as used by Lingualibre.org endpoint only support xml, json. So best is to return json and use JQ to format this. (I'm on it)
sparql2data has built-in data validation, only saves response if valid. Then, given query LL-LanguagesRecordsData.sparql, one can use Sparql2data as a module with command such as:
bash script.sh -q ./path/to/LL-LanguagesRecordsData.sparql -s lingualibre -f json
# Output response in ./data/LL-LanguagesRecordsData.json
We can borrow the core code from sparql2data, to integrate it into the Lingua-libre-bot's code :
# Sparql query
query=$(cat ${sparql})
# echo "QUERY= ${query}" | head -n 5
# CURL SPARQL query on Wikidata
response=$(curl -G --data-urlencode query="${query}" https://lingualibre.org/sparql?format=json)
echo "RESPONSE: ${response}" | head -n 20
# First cleanup
clean=$(echo "${response}" | jq '.results.bindings' | jq 'map(map_values(.value))' | sed -E "s/https?:\/\/.*\/entity\///g" )
## IF is valid response, THEN print to local file, ELSE error message.
firstline=$(echo "${clean}" | head -n 1)
if [[ ${firstline:0:1} == "[" ]]; then
echo "${clean}" > "./data/list_languages.json";
else
echo "XHR response appears invalid, was NOT printed to "./data/list_languages.json"
fi
Sparql2Data is also configured as a github page, with nightly builds, which can be queried as an API.
response=$(curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json)
echo ${response}
JQ is a well known package to process and reformat JSON data, i.e. :
curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
jq --raw-output '.[] | .language+" "+.records+" "+.languageLabel'
Or
curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'
Output:
...
Q25 33462 Esperanto
Q336 56756 Odia
Q307 62224 Bengali
Q298 92252 Polish
Q21 255005 French
Then you will have to load ($cat
?) and loop over that data which will provide several values per languages such as Qid, number of records, iso, ... to do what you want to.
#!/bin/bash
# USAGE: bash loop.sh file.tsv
filepath="$1"
while IFS=$'\t' read -r llqid records; do
# Run a command with the two columns as parameters
echo "Running command with parameters: $llqid $records"
if [[ $records >= 50000 ]]; then
# yearly python run $llqid
else
# minimal python run $llqid
fi
done < "$filepath"
@pamputt , I see the way head. https://github.com/lingua-libre/Lingua-Libre-Bot/pull/22 has been merged, so I can move forward to refine your file into a documented bash script.
SPARQL2JSON:
JSON:
To be reused in:
Processing
Example :
CSV
For csv: convert json to csv or download csv directly.