Convert formats to feed alternative usage

hugolpz commented 1 year ago

SPARQL2JSON:

script.sh

JSON:

To be reused in:

Processing

Example :

# For Lingua Libre Bot
cat LL-LanguagesRecordsData.json | jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'
# For Operations
cat LL-LanguagesActive.json | jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'

CSV

For csv: convert json to csv or download csv directly.

hugolpz commented 1 year ago

@pamputt, which file format do you prefer to replace your listing lingualibre languages (lili qid) to feed the bot ? Json, csv, tsv ?

I would prefer to save both the Qid and the number pf recordings, so all those languages over 50k records get divided sparql queries.

pamputt commented 1 year ago

The less complicated the better, so TSV if the best (or CSV), but definitely not JSON.

hugolpz commented 1 year ago

EDIT: Outch. Default Blazegraph API as used by Lingualibre.org endpoint only support xml, json. So best is to return json and use JQ to format this. (I'm on it)

JSON via sparql2data

sparql2data has built-in data validation, only saves response if valid. Then, given query LL-LanguagesRecordsData.sparql, one can use Sparql2data as a module with command such as:

bash script.sh -q ./path/to/LL-LanguagesRecordsData.sparql -s lingualibre -f json
# Output response in ./data/LL-LanguagesRecordsData.json

JSON via Lingualibre API direct call

We can borrow the core code from sparql2data, to integrate it into the Lingua-libre-bot's code :

# Sparql query
query=$(cat ${sparql})
# echo "QUERY= ${query}" | head -n 5

# CURL SPARQL query on Wikidata
response=$(curl -G --data-urlencode query="${query}" https://lingualibre.org/sparql?format=json)
echo "RESPONSE: ${response}" | head -n 20

# First cleanup
clean=$(echo "${response}" | jq '.results.bindings' | jq 'map(map_values(.value))' | sed -E "s/https?:\/\/.*\/entity\///g" )

## IF is valid response, THEN print to local file, ELSE error message.
firstline=$(echo "${clean}" | head -n 1)
if [[ ${firstline:0:1} == "[" ]]; then
    echo "${clean}" > "./data/list_languages.json"; 
else
    echo "XHR response appears invalid, was NOT printed to  "./data/list_languages.json"
fi

JSON via Github Sparql2Json

Sparql2Data is also configured as a github page, with nightly builds, which can be queried as an API.

response=$(curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json)
echo ${response}

TSV from JSON via JQ

JQ is a well known package to process and reformat JSON data, i.e. :

curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
    jq --raw-output '.[] | .language+"   "+.records+"   "+.languageLabel'

Or

curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
    jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'

Output:

...
Q25 33462   Esperanto
Q336    56756   Odia
Q307    62224   Bengali
Q298    92252   Polish
Q21 255005  French

Loop

Then you will have to load ($cat?) and loop over that data which will provide several values per languages such as Qid, number of records, iso, ... to do what you want to.

#!/bin/bash
# USAGE: bash loop.sh file.tsv

filepath="$1"
while IFS=$'\t' read -r llqid records; do
  # Run a command with the two columns as parameters
  echo "Running command with parameters: $llqid $records"
if [[ $records >= 50000 ]]; then
    # yearly python run $llqid
else
    # minimal python run $llqid
fi
done < "$filepath"

hugolpz commented 1 year ago

@pamputt , I see the way head. https://github.com/lingua-libre/Lingua-Libre-Bot/pull/22 has been merged, so I can move forward to refine your file into a documented bash script.

hugolpz / Sparql2Data