sparna-git / structured-data-extractor

A framework for extracting RDFa, JSON-LD, Microdata and text content from webpages
GNU Lesser General Public License v3.0
3 stars 1 forks source link

structured-data-extractor

A framework for extracting RDFa, JSON-LD, Microdata and text content from webpages. In general, it relies and uses on RDF4J framework.

(note; this documentation was written 6 years after initial development).

Modules

The project contains the following modules:

How to run the command-line interface

Here is a complete command-line example:

java -Xms512M -Xmx2048M -jar extractor-cli-1.0-SNAPSHOT-onejar.jar list \ --input random-elis-test.txt \ --output output \ --exclude processed-urls.log \ --namespaces eli,http://data.europa.eu/eli/ontology# xsd,http://www.w3.org/2001/XMLSchema# ev,http://eurovoc.europa.eu/ corp,http://publications.europa.eu/resource/authority/corporate-body/ lang,http://publications.europa.eu/resource/authority/language/ m-app,http://www.iana.org/assignments/media-types/application/ res-oj,http://publications.europa.eu/resource/oj/ res-celex,http://publications.europa.eu/resource/celex/

The file random-elis-test.txt contains the following URIs:

http://data.europa.eu/eli/reg/2003/20/oj
http://data.europa.eu/eli/reg/2002/41/oj
http://data.europa.eu/eli/dec/2002/95/oj
http://data.europa.eu/eli/dir/1998/91/oj
http://data.europa.eu/eli/dec/1983/59/oj
http://data.europa.eu/eli/reg/2002/43/oj
http://data.europa.eu/eli/dir/2001/5/oj
http://data.europa.eu/eli/dec/1982/39/oj
http://data.europa.eu/eli/dir/1997/53/oj

Note : somehow (I can't remember where exactly) the successfully processed URLs are logged into processed-urls.log, so that if the process fails for any reason, this file is being passed back to the second run and URLs already processed are not extracted a second time.

The folder under extractor-cli/src/test/resources/URI-lists contains the files of the URI lists that were used to extract dataset from Portugal, Eur-Lex, Ireland, Denmark and Italy, for the 2018 Datathon. The resulting datasets were published on the EU ODP, see https://data.europa.eu/data/datasets?locale=en&minScoring=0&query=ELI&page=1

Note on the storage in the triplestore

The extractor-server modules persists the result of the extraction in a triplestore. Each set of triples from each page is kept in a separate named graph. The named graph is identified by the original URL of the page, and is described with a dcterms:modified triple with the date of insertion, and a dcterms:isPartOf triple with the domain name of the page. This is useful to e.g. select or delete or query all named graphs coming from a known domain/website. See the class fr.sparna.rdf.extractor.RepositoryManagementListener that is responsible for this.

How this project could be reused for ELI-based search ?