opendatasicilia / tansignari

"T'ansignari e t'appeddiri"
http://tansignari.opendatasicilia.it
Creative Commons Attribution 4.0 International
18 stars 10 forks source link

tabelle in pagine web: estrarre autore e numero delle righe #228

Closed pigreco closed 1 year ago

pigreco commented 1 year ago

In queste pagine web al paragrafo Notable Fixes, ci sono delle tabelle, una sotto l'altra, con varie righe e colonne; ogni tabella è caratterizzata da un numero di righe, da un autore e da chi ha finanziato la risoluzione dei bug, sotto un esempio:

image

come estrarre, per ogni tabella, il numero di righe, autore e finanziatore?

sotto un esempio di output

Bugs fixed by These bugfixes were funded by numero righe
Even Rouault QGIS.ORG (through donations and sustaining memberships) 15
Alessandro Pasotti QGIS.ORG (through donations and sustaining memberships) 18
aborruso commented 1 year ago

Puoi iniziare a esplorare la pagina con VisiData

vd https://www.qgis.org/en/site/forusers/visualchangelog326/index.html#notable-fixes

Non appena possibile, torno con soluzione più pensata e cucita

pigreco commented 1 year ago

vd https://www.qgis.org/en/site/forusers/visualchangelog326/index.html#notable-fixes

conosco VisiData e anche il comando: spettacolare! manca solo l'autore e sponsor

image

aborruso commented 1 year ago

Ciao @pigreco, per questo tipo di task, devi imparare a fare query XPATH o CSS Selector.

Poi devi guardare la struttura della pagina, capire se c'è qualche elemento utile per distinguere la parte di tuo interesse da tutto il resto.

La parte di tuo interesse è dentro un tag section con id="notable-fixes".

La query XPATH per selezionare quella parte è //section[@id="notable-fixes"], che vuol dire: trovami un tag section ovunque nella pagina, ma che abbia come id il valore notable-fixes. Queste query le puoi testare anche nel browser.

image

Un altro elemento interessante di questa struttura HTML è che per ogni user, c'è una sub sezione con id uguale al nome dell'user.

image

Ho fatto uno script bash, che per grandi linee fa questo:

bug-fixes-by-even-rouault
bug-fixes-by-alessandro-pasotti
bug-fixes-by-alex-bruy
bug-fixes-by-sandro-santilli
bug-fixes-by-nyall-dawson
{"nome":"Even Rouault   ","numeroRighe":"16","funded":"These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)"}
nome,numeroRighe,funded
Even Rouault,15,These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
Alessandro Pasotti,18,These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
Alex Bruy,11,These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
Sandro Santilli,11,These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
Nyall Dawson,38,These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)

Come tool uso scrape (per le query XPATH), miller, e xq.


#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

URL="https://www.qgis.org/en/site/forusers/visualchangelog326/index.html#notable-fixes"

# scarica pagina
curl -kL "$URL" >"$folder"/tmp.html

# estrai id persone
scrape <"$folder"/tmp.html -be '//section[@id="notable-fixes"]/section' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-id.txt

if [ -f "$folder"/toto.jsonl ]; then
  rm "$folder"/toto.jsonl
fi

# per ogni utente estrai dati
while read id; do
  nome=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
  numeroRighe=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]/table/tbody/tr' | xq '.html.body.tr|length')
  funded=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]//p[contains(.,"funded")]' | xq -r '(.html.body.p."#text")+""+(.html.body.p.a."#text")')
  echo '{"nome":"'"$nome"'","numeroRighe":"'"$numeroRighe"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
done <"$folder"/toto-id.txt

mlr --j2c clean-whitespace "$folder"/toto.jsonl >"$folder"/toto.csv
pigreco commented 1 year ago

lista delle pagine web da cui scaricare i dati:

https://www.qgis.org/en/site/forusers/visualchangelog36/index.html https://www.qgis.org/en/site/forusers/visualchangelog38/index.html https://www.qgis.org/en/site/forusers/visualchangelog310/index.html https://www.qgis.org/en/site/forusers/visualchangelog312/index.html https://www.qgis.org/en/site/forusers/visualchangelog314/index.html https://www.qgis.org/en/site/forusers/visualchangelog316/index.html https://www.qgis.org/en/site/forusers/visualchangelog318/index.html https://www.qgis.org/en/site/forusers/visualchangelog320/index.html https://www.qgis.org/en/site/forusers/visualchangelog322/index.html https://www.qgis.org/en/site/forusers/visualchangelog324/index.html https://www.qgis.org/en/site/forusers/visualchangelog326/index.html https://www.qgis.org/en/site/forusers/visualchangelog328/index.html

pigreco commented 1 year ago

le pagine web da cui estrarre i dati non sono molte e quindi procedo manualmente a cambiare URL e avviare lo script, successivamente, tramite cat (cat *.csv > unico.csv) appendo tutti i file csv e poi tolgo manualmnete le intestazioni in più.

ecco il risultato (da QGIS 3.6 a QGIS 3.26)

date version developer nroBugFixes funded
2019-02-22 QGIS 3.6 Alessandro Pasotti 28 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Alexander Bruy 27 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Even Rouault 6 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Hugo Mercier 9 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Julien Cabieces 9 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Jürgen Fischer 20 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Loïc Bartoletti 5 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Martin Dobias 8 This feature was funded byQGIS user group Germany
2019-02-22 QGIS 3.6 Nyall Dawson 20 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Peter Petrik 8 This feature was funded byQGIS.ORG donors and sponsors
2019-02-22 QGIS 3.6 Victor Olaya 10 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Alessandro Pasotti 33 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Alexander Bruy 15 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Denis Rouzaud 1 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Even Rouault 9 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Loïc Bartoletti 4 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Peter Petrik 7 This feature was funded byQGIS.ORG donors and sponsors
2019-06-21 QGIS 3.8 Victor Olaya 10 This feature was funded byQGIS.ORG donors and sponsors
2019-10-25 QGIS 3.10 Alessandro Pasotti 40 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Alexander Bruy 19 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Even Rouault 13 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Matthias Kuhn 4 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Nyall Dawson 74 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Paul Blottiere 5 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Peter Petrik 8 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2019-10-25 QGIS 3.10 Sandro Santilli 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Alessandro Pasotti 30 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Alexander Bruy 4 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Bertrand Rix 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Denis Rouzaud 7 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Even Rouault 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Julien Cabieces 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Loïc Bartoletti 11 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Nyall Dawson 22 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Paul Blottiere 7 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Sandro Santilli 5 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Sebastien Peillet 7 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-02-21 QGIS 3.12 Stephen Knox 1
2020-06-19 QGIS 3.14 Alessandro Pasotti 31 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Alexander Bruy 15 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Audun Ellertsen 2 This feature was funded byKongsberg Digital
2020-06-19 QGIS 3.14 Bertrand Rix 4 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Denis Rouzaud 6 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Even Rouault 17 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Julien Cabieces 13 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Loïc Bartoletti 5 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Nyall Dawson 66 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Paul Blottiere 8 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-06-19 QGIS 3.14 Sebastien Peillet 6 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Alessandro Pasotti 44 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Denis Rouzaud 8 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Even Rouault 20 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Julien Cabieces 23 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Matthias Kuhn 4 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Nyall Dawson 83 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Olivier Dalang 1 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Paul Blottiere 11 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2020-10-23 QGIS 3.16 Peter Petrik 48 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-02-22 QGIS 3.18 Alessandro Pasotti 23 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-02-22 QGIS 3.18 Even Rouault 11 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-02-22 QGIS 3.18 Julien Cabieces 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-02-22 QGIS 3.18 Nyall Dawson 31 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-02-22 QGIS 3.18 Peter Petrik 14 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Alessandro Pasotti 29 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Denis Rouzaud 9 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Even Rouault 14 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Julien Cabieces 8 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Loïc Bartoletti 7 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Nyall Dawson 46 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Paul Blottiere 7 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-06-21 QGIS 3.20 Peter Petrik 6 This feature was funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Alessandro Pasotti 26 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Denis Rouzaud 1 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Even Rouault 15 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Julien Cabieces 11 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Loïc Bartoletti 9 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Nyall Dawson 24 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Peter Petrik 8 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2021-10-22 QGIS 3.22 Sandro Santilli 10 These bug fixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Alessandro Pasotti 27 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Alexander Bruy 21 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Damiano Lombardi 1 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Denis Rouzaud 3 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Even Rouault 8 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Matthias Kuhn 1 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Nyall Dawson 29 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Paul Blottiere 5 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-02-18 QGIS 3.24 Sandro Santilli 7 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-06-18 QGIS 3.26 Alessandro Pasotti 18 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-06-18 QGIS 3.26 Alexander Bruy 11 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-06-18 QGIS 3.26 Even Rouault 15 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-06-18 QGIS 3.26 Nyall Dawson 38 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
2022-06-18 QGIS 3.26 Sandro Santilli 11 These bugfixes were funded byQGIS.ORG (through donations and sustaining memberships)
pigreco commented 1 year ago

statistiche (da QGIS 3.6 a QGIS 3.26) bug 1450

name nroBugFixes % nroVersion
Nyall Dawson 433 29,9% 10
Alessandro Pasotti 329 22,7% 11
Even Rouault 137 9,4% 11
Alexander Bruy 112 7,7% 7
Peter Petrik 99 6,8% 7
Julien Cabieces 82 5,7% 7
Paul Blottiere 43 3,0% 6
Sandro Santilli 42 2,9% 5
Loïc Bartoletti 41 2,8% 6
Denis Rouzaud 35 2,4% 7
Jürgen Fischer 20 1,4% 1
Victor Olaya 20 1,4% 2
Bertrand Rix 13 0,9% 2
Sebastien Peillet 13 0,9% 2
Hugo Mercier 9 0,6% 1
Matthias Kuhn 9 0,6% 3
Martin Dobias 8 0,6% 1
Audun Ellertsen 2 0,1% 1
Damiano Lombardi 1 0,1% 1
Olivier Dalang 1 0,1% 1
Stephen Knox 1 0,1% 1

image

date version number
2019-02-22 QGIS 3.6 150
2019-06-21 QGIS 3.8 79
2019-10-25 QGIS 3.10 172
2020-02-21 QGIS 3.12 121
2020-06-19 QGIS 3.14 173
2020-10-23 QGIS 3.16 242
2021-02-22 QGIS 3.18 88
2021-06-21 QGIS 3.20 126
2021-10-22 QGIS 3.22 104
2022-02-18 QGIS 3.24 102
2022-06-18 QGIS 3.26 93

image

pigreco commented 1 year ago

@aborruso grazie mille per l'esaustiva spiegazione, sembra tutto facile quando spieghi le cose; ti invidio tanto perché sono strumenti che mi piacerebbe molto saper usare, ma qui ci vuole molta esperienza e creatività per capire cosa cercare e filtrare.

grazie mille per il tempo che ci hai dedicato

pigreco commented 1 year ago

@aborruso non riesco a fare il ciclo FOR su un insieme di link, questo script non funziona o meglio estrae solo i dati del primo link

#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

LINK="https://www.qgis.org/en/site/forusers/visualchangelog36/index.html#notable-fixes
https://www.qgis.org/en/site/forusers/visualchangelog38/index.html#notable-fixes"

# crea ciclo con le pagine web
for lista in $LINK;do
# scarica pagina
curl -kL "$lista" >"$folder"/tmp.html

# estrai id persone
scrape <"$folder"/tmp.html -be '//section[@id="notable-fixes"]/section' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-id.txt

# if [ -f "$folder"/toto.jsonl ]; then
#  rm "$folder"/toto.jsonl
# fi

# per ogni utente estrai dati
while read id; do
  versione=$LINK
  nome=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
  numeroRighe=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]/table/tbody/tr' | xq '.html.body.tr|length')
  funded=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]//p[contains(.,"funded")]' | xq -r '(.html.body.p."#text")+""+(.html.body.p.a."#text")')
  echo '{"versione":""'"$versione"'",nome":"'"$nome"'","numeroRighe":"'"$numeroRighe"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
done <"$folder"/toto-id.txt
done
pigreco commented 1 year ago

@aborruso ora funziona il ciclo, ma ho errore in Miller:

+ mlr --j2c clean-whitespace /mnt/c/Users/pigre/Desktop/featureQGIS/toto.jsonl
mlr: Unable to parse JSON data: Line 1 column 5: Unexpected `h` in object

script

#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

LINK="https://www.qgis.org/en/site/forusers/visualchangelog36/index.html#notable-fixes
https://www.qgis.org/en/site/forusers/visualchangelog38/index.html#notable-fixes"

# crea ciclo con le pagine web

for lista in $LINK
do

# scarica pagina
    curl -kL "$lista" >"$folder"/tmp.html

# estrai id persone
    scrape <"$folder"/tmp.html -be '//section[@id="notable-fixes"]/section' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-id.txt

# per ogni utente estrai dati
    while read id; do
        versione=$lista
        nome=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
        numeroRighe=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]/table/tbody/tr' | xq '.html.body.tr|length')
        funded=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]//p[contains(.,"funded")]' | xq -r '(.html.body.p."#text")+""+(.html.body.p.a."#text")')
        echo '{"versione":""'"$versione"'",nome":"'"$nome"'","numeroRighe":"'"$numeroRighe"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
    done <"$folder"/toto-id.txt

    if [ -f "$folder"/toto-id.txt ]; then
    rm "$folder"/toto-id.txt
    fi
done

mlr --j2c clean-whitespace "$folder"/toto.jsonl >>"$folder"/toto.csv
pigreco commented 1 year ago

@aborruso ho trovato gli errori, sono qui:

echo '{"versione":""'"$versione"'",nome":"'"$nome"'","numeroRighe":"'"$numeroRighe"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl

ci sono " messi male

ora funziona!!!

echo '{"versione":"'"$versione"'","nome":"'"$nome"'","numeroRighe":"'"$numeroRighe"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
pigreco commented 1 year ago

questo script cicla su tutti le pagine web e crea unico file csv:

#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# crea variabile con lista degli URL
LINK="https://www.qgis.org/en/site/forusers/visualchangelog36/index.html
https://www.qgis.org/en/site/forusers/visualchangelog38/index.html"

# crea ciclo con le pagine web

for lista in $LINK
do

# scarica pagina
    curl -kL "$lista" >"$folder"/tmp.html

# estrai id persone
    scrape <"$folder"/tmp.html -be '//section[@id="notable-fixes"]/section' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-id.txt

# per ogni utente estrai dati
    while read id; do
        version=`echo "$lista" | sed -e 's/[^0-9]//g' | sed -e 's/^/QGIS /' | sed -e 's/QGIS 3/QGIS 3./'`
        developer=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
        nroBugsFixes=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]/table/tbody/tr' | xq '.html.body.tr|length')
        funded=$(scrape <"$folder"/tmp.html -be '//section[@id="'"$id"'"]//p[contains(.,"funded")]' | xq -r '(.html.body.p."#text")+""+(.html.body.p.a."#text")')
        data=`grep 'Release date:' "$folder"/tmp.html | sed -e 's/[^0-9-]//g'`
        echo '{"data":"'"$data"'","version":"'"$version"'","developer":"'"$developer"'","nroBugsFixes":"'"$nroBugsFixes"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
    done <"$folder"/toto-id.txt

    if [ -f "$folder"/toto-id.txt ]; then
    rm "$folder"/toto-id.txt
    fi
done

# ripulisce il file dalla presenza di tab \t
sed -i 's/\t//g' "$folder"/toto.jsonl

# converte da jsonl a CSV
mlr --j2c clean-whitespace "$folder"/toto.jsonl  >"$folder"/toto.csv

# rimuove file non piu' utili
rm tmp.*
rm *.jsonl

image

aborruso commented 1 year ago

Ma questa perché non è chiusa, con ricetta?

pigreco commented 1 year ago

Ciao @aborruso

Ma questa perché non è chiusa, con ricetta?

non ricordo perché è ancora aperta, ma non credo di aver fatto ricetta; le farò, ma non so quando :-(

pigreco commented 1 year ago

image

ricetta fatta e pubblicata: https://tansignari.opendatasicilia.it/ricette/bash/tabelle_in_pagine_web_estrarre_autore_e_nro_righe/

grazie mille(r) @aborruso