Estrarre testo da pagina html

pigreco commented 2 years ago

Pagina web come questa, ci sono varie sezioni (1), (2) e (3)

come estrarre la riga al punto (1) con link, e le righe (2) e (3)?

esempio a partire dallo scresnshot di sopra:

Feature: Horizontal table scroll with shift+wheel
(https://www.qgis.org/en/site/forusers/visualchangelog326/index.html#id12)

funded by the City of Canning
developed by Nyall Dawson (North Road Consulting)

e creare una tabella, perché nella pagina web ci sono molte feature

descrizione	link	funded by	developer by
Feature: Horizontal table scroll with shift+wheel	https://www.qgis.org/en/site/forusers/visualchangelog326/index.html#id12	the City of Canning	Nyall Dawson (North Road Consulting)

PS: è un lavoro che finirà (molto probabilmente) nel nuovo sito di QGIS.org

gbvitrano commented 2 years ago

hai già visto questa ricetta? https://tansignari.opendatasicilia.it/ricette/query/estrarre_liste_da_web/

pigreco commented 2 years ago

hai già visto questa ricetta? https://tansignari.opendatasicilia.it/ricette/query/estrarre_liste_da_web/

Ciao @gbvitrano , no, non la avevo vista.

Grazie per il suggerimento, provo e riferisco

pigreco commented 2 years ago

@gbvitrano

sembra che sia un diverso l'approccio, nella mia pagina web e seguendo la ricetta NON riesco a trovare i div e quindi ottengo un file txt vuoto.

ho scaricato la pagina:

curl "https://www.qgis.org/en/site/forusers/visualchangelog326/index.html" > ./pagina.html

e poi questo mi dà file vuoto

<./pagina.html scrape -e '//div[@class="reference external"]/text()' >./toto.txt

aborruso commented 2 years ago

e poi questo mi dà file vuoto

@pigreco in quella pagina, non ci sono div con la classe "reference external".

pigreco commented 2 years ago

e poi questo mi dà file vuoto

@pigreco in quella pagina, non ci sono div con la classe "reference external".

ma io ho seguito la ricetta e vengono fuori questi:

//div[@class="toc-backref"]/text():
//div[@class="reference external"]/text():
//div[@class="reference external"]/text():

ma mi fermo.

pigreco commented 2 years ago

@aborruso

chiedo venia, in realtà facendo tasto destro e copiando element viene fuori

<a class="toc-backref" href="#id6">Feature: Coordinate ordering according to CRS</a>

gbvitrano commented 2 years ago

Riesco solo a estrarre h2 e h3 (titolo e sottotitolo) con i relativi link, ma non riesco ad estrarre funded e developed sono paragrafi

2022-09-29_12h27_34

per i titoli =IMPORTXML("https://www.qgis.org/en/site/forusers/visualchangelog326/index.html"; "//div[@class='container']//section/h2")

link =IMPORTXML("https://www.qgis.org/en/site/forusers/visualchangelog326/index.html"; "//div[@class='container']//section/h2/a[@class='toc-backref']/@href")

pigreco commented 2 years ago

@gbvitrano per estrare i paragrafi ho provato:

=IMPORTXML("https://www.qgis.org/en/site/forusers/visualchangelog326/index.html"; "//div[@class='container']//section/p[3]")

sembra funzionare ma è ancora sporco

le varie sezioni hanno più paragrafi e non sempre corrispondono, esempio p[3] non rappresenta sempre lo stesso contenuto.

aborruso commented 2 years ago

ciao @pigreco questa è simile a #228

Puoi estrarre l'elenco delle sezioni che sono feature

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog326/index.html | scrape -be '//section[contains(@id,"feature-")]' | xq -r '.html.body.section[]."@id"'

E poi per ogni id di sezione (ad esempio feature-selecting-all-features-by-attribute-value-from-identify-results-panel):

titolo

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog326/index.html | scrape -e '//section[@id="feature-selecting-all-features-by-attribute-value-from-identify-results-panel"]/h3/a[1]/text()'

l'URL lo concateni con l'ID, è un'ancora nella pagina;
il funded (c'è da fare un po' di estrazione e concatenazione, ma c'è un esempio nell'altra issue)

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog326/index.html | scrape -e '//section[@id="feature-selecting-all-features-by-attribute-value-from-identify-results-panel"]//p[contains(.,"funded b")]'

developed by

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog326/index.html | scrape -e '//section[@id="feature-selecting-all-features-by-attribute-value-from-identify-results-panel"]//p[contains(.,"developed b")]'

pigreco commented 2 years ago

Puoi estrarre l'elenco delle sezioni che sono feature

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog326/index.html | scrape -be '//section[contains(@id,"feature-")]' | xq -r '.html.body.section[]."@id"'

perché funziona solo per la pagina: https://www.qgis.org/en/site/forusers/visualchangelog326/index.html?

per esempio non funziona per https://www.qgis.org/en/site/forusers/visualchangelog324/index.html

aborruso commented 2 years ago

per esempio non funziona per https://www.qgis.org/en/site/forusers/visualchangelog324/index.html

A me funziona

curl -kL https://www.qgis.org/en/site/forusers/visualchangelog324/index.html | scrape -be '//section[contains(@id,"feature-")]' | xq -r '.html.body.section[]."@id"'

pigreco commented 2 years ago

devo iterare su queste pagine

pigreco commented 2 years ago

Questo restituisce quello cercato, ma il file è ancora "sporco"

#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

LINK="https://www.qgis.org/en/site/forusers/visualchangelog36/index.html
https://www.qgis.org/en/site/forusers/visualchangelog38/index.html"

# crea ciclo con le pagine web

for lista in $LINK
do

# scarica pagina
    curl -kL "$lista" >"$folder"/tmp.html

# estrai feature
    scrape <"$folder"/tmp.html -be '//section[contains(@id,"feature-")]' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-feature.txt

# per ogni feature estrai dati
    while read id; do
        version=`echo "$lista" | sed -e 's/[^0-9]//g' | sed -e 's/^/QGIS /' | sed -e 's/QGIS 3/QGIS 3./'`
        feature=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
    #   argomento=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h2/a[1]/text()')
        developed=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]//p[contains(.,"developed b")]' | sed -r 's/.+(">)(.+)(<\/a><\/p>)/\2/g')
        funded=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]//p[contains(.,"funded b")]' | sed -r 's/.+(">)(.+)(<\/a><\/p>)/\2/g')
        data=`grep 'Release date:' "$folder"/tmp.html | sed -e 's/[^0-9-]//g'`
        echo '{"data":"'"$data"'","version":"'"$version"'","feature":"'"$feature"'","developed":"'"$developed"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
    done <"$folder"/toto-feature.txt

    if [ -f "$folder"/toto-feature.txt ]; then
    rm "$folder"/toto-feature.txt
    fi
done

mlr --j2c clean-whitespace "$folder"/toto.jsonl >>"$folder"/totoFeature.csv

i campi developere funded alcune volte sono puliti (solo nomi) altre volte si porta dietro i tag html <p>

pigreco commented 2 years ago

questo script fa cose che pampine

#!/bin/bash

set -x
set -e
set -u
set -o pipefail

folder="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

LINK="https://www.qgis.org/en/site/forusers/visualchangelog30/index.html
https://www.qgis.org/en/site/forusers/visualchangelog32/index.html
https://www.qgis.org/en/site/forusers/visualchangelog34/index.html
https://www.qgis.org/en/site/forusers/visualchangelog36/index.html
https://www.qgis.org/en/site/forusers/visualchangelog38/index.html
https://www.qgis.org/en/site/forusers/visualchangelog310/index.html
https://www.qgis.org/en/site/forusers/visualchangelog312/index.html
https://www.qgis.org/en/site/forusers/visualchangelog314/index.html
https://www.qgis.org/en/site/forusers/visualchangelog318/index.html
https://www.qgis.org/en/site/forusers/visualchangelog320/index.html
https://www.qgis.org/en/site/forusers/visualchangelog322/index.html
https://www.qgis.org/en/site/forusers/visualchangelog324/index.html
https://www.qgis.org/en/site/forusers/visualchangelog326/index.html
https://www.qgis.org/en/site/forusers/visualchangelog316/index.html"

# crea ciclo con le pagine web

for lista in $LINK
do

# scarica pagina
    curl -kL "$lista" >"$folder"/tmp.html
# estrae versione
    version=`echo "$lista" | sed -e 's/[^0-9]//g' | sed -e 's/^/QGIS/' | sed -e 's/QGIS3/QGIS3-/'`

# estrai le sezioni che contengono feture e non (purtroppo la 3.16 non inizia la sezione con feature)
    scrape <"$folder"/tmp.html -be '//section[contains(@id,"")]' | xq -r '.html.body.section[]."@id"' >"$folder"/toto-feature$version.txt
    if [ -f "$folder"/toto-featureQGIS3-16.txt ]; then
        mv toto-featureQGIS3-16.txt toto-featureQGIS3-16-pulito.txt
    else 
        <./toto-feature$version.txt grep -P '^feature-' >toto-feature$version-pulito.txt
    fi

# per ogni feature estrai dati
    while read id; do
        version=`echo "$lista" | sed -e 's/[^0-9]//g' | sed -e 's/^/QGIS /' | sed -e 's/QGIS 3/QGIS 3./'`
        feature=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h3/a[1]/text()' | sed -r 's/^.+by *//')
    #   argomento=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]/h2/a[1]/text()')
        developed=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]//p[contains(.,"developed b")]' | sed -r 's/.+(">)(.+)(<\/a><\/p>)/\2/g')
        funded=$(scrape <"$folder"/tmp.html -e '//section[@id="'"$id"'"]//p[contains(.,"funded b")]' | sed -r 's/.+(">)(.+)(<\/a><\/p>)/\2/g')
        data=`grep 'Release date:' "$folder"/tmp.html | sed -e 's/[^0-9-]//g'`
        echo '{"data":"'"$data"'","version":"'"$version"'","feature":"'"$feature"'","developed":"'"$developed"'","funded":"'"$funded"'"}' >>"$folder"/toto.jsonl
    done <"$folder"/toto-feature$version-pulito.txt

done

mlr --j2c clean-whitespace "$folder"/toto.jsonl >>"$folder"/totoFeatureALL.csv

qui output pulito:

https://data.world/pigrecoinfinito/feature-qgis-3326/workspace/file?filename=totoFeatureALL_test.csv

pigreco commented 1 year ago

ricetta fatta e pubblicata

https://tansignari.opendatasicilia.it/ricette/bash/estrarre_testo_e_link_da_pagineweb/

opendatasicilia / tansignari

Estrarre testo da pagina html #227