spraakbanken / swegov-opendata-rs

MIT License
0 stars 0 forks source link

swegov-opendata-rs

Tools used for collecting SFS (Svensk Författningssamling) from Riksdagens öppna data.

MIT licensed

Maturity badge - level 1

CI(check) CI(scheduled) CI(test)

fetch-sfs

Binary to run for collecting SFS.

Uses webcrawler and opendata-spider.

Takes roughly 1 hour to fetch all SFS data.

opendata-spiders

Lives in opendata-spider.

Uses swegov-opendata.

sfs

Contains concrete spider for collecting SFS.

This spider spawns urls that searches for documents of type SFS in 20 years spans, using the data.riksdagen.se/dokumentlista path.

These lists are scraped for dok_id to scrape documents and nasta_sida to scrape next page in the dokumentlista.

All fetched pages are stored to disk in JSON-format, except for the pages with html fragments, that are stored as-is. The documents are grouped by year.

This spider handles the following inconsistencies in the api.

sfs-corpus

Uses swegov-opendata. Build corpus files for processing with sparv.

swegov-opendata

Data model for the documents and document lists from riksdagens öppna data with serde serialization and deserialization.

webcrawler

Lives in webcrawler.

Generic web crawler that defines an interface for spiders.

The spiders work in 2 steps,

References

MSRV Policy

The MSRV (Minimum Supported Rust Version) is fixed for a given minor (1.x) version. However it can be increased when bumping minor versions, i.e. going from 1.0 to 1.1 allows us to increase the MSRV. Users unable to increase their Rust version can use an older minor version instead. Below is a list of swegov-opendata-rs versions and their MSRV:

Note however that swegov-opendata-rs also has dependencies, which might have different MSRV policies. We try to stick to the above policy when updating dependencies, but this is not always possible.