mqAncientHistory / Lat-Epig

The Lat-Epig interface allows you to query the EDCS and save the search result in a TSV file and plot the results on a map of the Roman Empire without any prior knowledge of programming.
https://mybinder.org/v2/gh/mqAncientHistory/Lat-Epig/HEAD?urlpath=notebooks/EpigraphyScraper.ipynb
GNU General Public License v3.0
15 stars 0 forks source link

Fix extra text in the text of inscriptions #14

Closed petrifiedvoices closed 3 years ago

petrifiedvoices commented 3 years ago

EDCS-73700333 - extra biblio references in the text of inscription

Instead of scraping inscription and biblio references as separate attributes, scraper scraped text of an inscription along with the DOI info. Scraper plops the text of the comment into the same attribute as text of inscription, although it should go to the commentary.

HTML: Iesus s(anc)t<u=O>(s) ego Iesus sum ego fui s(anc)t<u=O>(s) ego s(anc)t<u=O>(s) fui / s(anc)t<u=O>(s) Iesus fui s(anc)t<u=O>(s) Iesus est Iesus s(anc)t<u=O>(s) est / Ego [3] / s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus est Iesus fuit s(anc)t<u=O>(s) Iesus / s(anc)t<u=O>(s) fuit Iesus s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus fuit / s(anc)t<u=O>(s) Iesus est Iesus est alef s(anc)t<u=O>(s) Iesus Iesus / Iesus s(anc)t<u=O>(s) Iesus alef est |(omega) est s(anc)t<u=O>(s) Iesus / Iesus s(anc)t<u=O>(s) fui s(anc)t(us) Iesus fui s(anc)t<u=O>(s) I[esus] / s(anc)t<u=O>(s) Iesus s(anc)t<u=O>(s) Iesus Iesus [fuit] / s(anc)t<u=O>(s) Iesus Iesus est s(anc)t(us) Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fuit Iesus ego fui I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) s(anc)t<u=O>(s) fuit s(anc)t<u=O>(s) I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) Iesus fuit Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fui s(anc)t<u=O>(s) Iesus fuit fui s(anc)t<u=O>(s) Iesus

<b>comment:</b> DOI: <a href="https://doi.org/10.15581/012.26.004" target="_blank">10.15581/012.26.004</a>

As is now scraped to CSV: Inscription attribute: Iesus s(anc)t<u=O>(s) ego Iesus sum ego fui s(anc)t<u=O>(s) ego s(anc)t<u=O>(s) fui / s(anc)t<u=O>(s) Iesus fui s(anc)t<u=O>(s) Iesus est Iesus s(anc)t<u=O>(s) est / Ego [3] / s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus est Iesus fuit s(anc)t<u=O>(s) Iesus / s(anc)t<u=O>(s) fuit Iesus s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus fuit / s(anc)t<u=O>(s) Iesus est Iesus est alef s(anc)t<u=O>(s) Iesus Iesus / Iesus s(anc)t<u=O>(s) Iesus alef est |(omega) est s(anc)t<u=O>(s) Iesus / Iesus s(anc)t<u=O>(s) fui s(anc)t(us) Iesus fui s(anc)t<u=O>(s) I[esus] / s(anc)t<u=O>(s) Iesus s(anc)t<u=O>(s) Iesus Iesus [fuit] / s(anc)t<u=O>(s) Iesus Iesus est s(anc)t(us) Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fuit Iesus ego fui I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) s(anc)t<u=O>(s) fuit s(anc)t<u=O>(s) I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) Iesus fuit Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fui s(anc)t<u=O>(s) Iesus fuit fui s(anc)t<u=O>(s) Iesus\n\n10.15581/012.26.004

Desired outcome: Inscription attribute: Iesus s(anc)t<u=O>(s) ego Iesus sum ego fui s(anc)t<u=O>(s) ego s(anc)t<u=O>(s) fui / s(anc)t<u=O>(s) Iesus fui s(anc)t<u=O>(s) Iesus est Iesus s(anc)t<u=O>(s) est / Ego [3] / s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus est Iesus fuit s(anc)t<u=O>(s) Iesus / s(anc)t<u=O>(s) fuit Iesus s(anc)t<u=O>(s) Iesus fuit s(anc)t<u=O>(s) Iesus fuit / s(anc)t<u=O>(s) Iesus est Iesus est alef s(anc)t<u=O>(s) Iesus Iesus / Iesus s(anc)t<u=O>(s) Iesus alef est |(omega) est s(anc)t<u=O>(s) Iesus / Iesus s(anc)t<u=O>(s) fui s(anc)t(us) Iesus fui s(anc)t<u=O>(s) I[esus] / s(anc)t<u=O>(s) Iesus s(anc)t<u=O>(s) Iesus Iesus [fuit] / s(anc)t<u=O>(s) Iesus Iesus est s(anc)t(us) Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fuit Iesus ego fui I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) s(anc)t<u=O>(s) fuit s(anc)t<u=O>(s) I[esus] / Iesus s(anc)t(us) fuit s(anc)t<u=O>(s) Iesus fuit Iesus fuit s(anc)t<u=O>(s) / Iesus s(anc)t<u=O>(s) fui s(anc)t<u=O>(s) Iesus fuit fui s(anc)t<u=O>(s) Iesus Comments attribute: 10.15581/012.26.004

Examples of other inscriptions with a similar problem: EDCS-75000138, EDCS-75000139, EDCS-44500182 (Total 143 inscriptions, as a result of HTML tag error)

Link to the CSVs with minimal examples (Git does not allow me to paste them here): https://github.com/sdam-au/EDCS_ETL/tree/master/output

petrifiedvoices commented 3 years ago

If there was an easy way how to extract the text of the inscriptions on its own... however, the HTMLs I have seen do not use ANY specific tag for the text of the inscription, but place the text in double quotes within the large <p>. The text may come after place or <noscript>, but I am not sure how consistent this is...

petrifiedvoices commented 3 years ago

fixed by rewrite