A collection of scrapers to obtain documents from german parliaments and related public institutions
To run a full scrape with the current legislative term:
memorious run <scraper_name>
By default, documents (usually pdfs) and their metadata (json files) are stored
in ./data/<scraper_name>
All scrapers can have these options (unless otherwise mentioned in the detailed descriptions for each scraper) via env vars to filter the scraping.
DOCUMENT_TYPES
- major_interpellation
or minor_interpellation
(Große Anfrage / Kleine Anfrage)LEGISLATIVE_TERMS
- an integer, refer to the detailed scraper description for possible valuesSTART_DATE
- a date (isoformat) to scrape documents only published since this dateEND_DATE
- a date (isoformat) to scrape documents only published until this dateFor example, to scrape all minor interpellations for Bayern from the last (not current) legislative term but only since 2018:
DOCUMENT_TYPES=minor_interpellation LEGISLATIVE_TERMS=17 START_DATE=2018-01-01 memorious run by
By default, scrapers are only executing requests and downloading documents they
have not seen before. To disable this behaviour, set
MEMORIOUS_INCREMENTAL=false
German state parliaments:
Other scrapers:
Landtag Brandenburg
memorious run bb
The scraper uses the starweb implementation using this form: https://www.parlamentsdokumentation.brandenburg.de/starweb/LBB/ELVIS/servlet.starweb?path=LBB/ELVIS/LISSH.web&AdvancedSearch=yes
LEGISLATIVE_TERMS
:current: 7
earliest: 1
DOCUMENT_TYPES
:Unfortunately, Brandenburg gives no results with answers for types "Kleine Anfrage" or "Große Anfrage", so the type option is unusable.
Abgeordnetenhaus Berlin
memorious run be
The backend used to be starweb, but recently (2021-06-24) changed to something
completly new, which could still be something starweb related, but according to
urls it is called "portala". The scraper still requires some refining to work
properly with date / document_type options, for now a START_DATE
is always
required to run.
The scraper sends some json that looks like an Elasticsearch query via post to this endpoint: https://pardok.parlament-berlin.de/portala/browse.tt.html from a query template
Although the new frontend looks fancy, that doesn't mean the service is performant. With too large queries (a long date range above a few months) it will shut down and return a 502 Error.
LEGISLATIVE_TERMS
:current: 18
earliest: 11
DOCUMENT_TYPES
:written_interpellation
(Both "Große" and "Kleine" anfragen)
Landtag von Baden-Württemberg
memorious run bw
Looks the same as be ("portala") but uses a different query template
Bayerischer Landtag
memorious run by
The scraper uses this result page: https://www.bayern.landtag.de/parlament/dokumente/drucksachen/?dokumentenart=Drucksache&anzahl_treffer=10
LEGISLATIVE_TERMS
:current: 18
earliest: 1 (but useful metadata starts at 5 [1962-66])
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Bremische Bürgerschaft
memorious run hb
The scraper uses the starweb implementation using this form: https://paris.bremische-buergerschaft.de/starweb/paris/servlet.starweb?path=paris/LISSH.web
LEGISLATIVE_TERMS
:current: 20
earliest: TODO
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Hessischer Landtag
memorious run he
The scraper uses the starweb implementation using this form: http://starweb.hessen.de/starweb/LIS/servlet.starweb?path=LIS/PdPi.web
LEGISLATIVE_TERMS
:current: 20
earliest: 14 (or: 8?) // TODO
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Hamburgische Bürgerschaft
memorious run hh
The scraper uses the parldok [5.4.1] implementation using this form: https://www.buergerschaft-hh.de/parldok/formalkriterien
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
LEGISLATIVE_TERMS
:current: 22
earliest: 16
Landtag Mecklenburg-Vorpommern
memorious run mv
The scraper uses the parldok [5.6.0] implementation using this form: https://www.dokumentation.landtag-mv.de/parldok/formalkriterien/
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
LEGISLATIVE_TERMS
:current: 7
earliest: 1
Landtag Niedersachsen
memorious run ni
The scraper uses the starweb implementation using this form: https://www.nilas.niedersachsen.de/starweb/NILAS/servlet.starweb?path=NILAS/lissh.web
LEGISLATIVE_TERMS
:current: 18
earliest: 10
Landtag Nordrhein-Westfalen
The scraper uses the base url
https://www.landtag.nrw.de/home/dokumente/dokumentensuche/parlamentsdokumente/parlamentsdatenbank-suchergebnis.html?dokart=DRUCKSACHE&doktyp=KLEINE%20ANFRAGE&wp=18
and manipulates GET
parameters
LEGISLATIVE_TERMS
:current: 18
earliest: 10
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Landtag Rheinland-Pfalz
memorious run rp
The scraper uses the starweb implementation using this form: https://opal.rlp.de/starweb/OPAL_extern/servlet.starweb?path=OPAL_extern/PDOKU.web
LEGISLATIVE_TERMS
:current: 18
earliest: 11
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Landtag Schleswig-Holstein
The scraper uses this base url
http://lissh.lvn.parlanet.de/cgi-bin/starfinder/0?path=lisshfl.txt&id=FASTLINK&pass=&search=WP%3D19+AND+dtyp%3Dkleine&format=WEBKURZFL
and adjusts GET
parameters.
Landtag des Saarlandes
The scraper posts json queries to this url https://www.landtag-saar.de/umbraco/aawSearchSurfaceController/SearchSurface/GetSearchResults/
Sächsischer Landtag
This thing is just annoying: https://edas.landtag.sachsen.de/
But the scraper does work, more or less.
For now only the current LEGISLATIVE_TERM
(7) is possible.
Landtag von Sachsen-Anhalt
memorious run st
The scraper uses the starweb implementation using this form: https://padoka.landtag.sachsen-anhalt.de/starweb/PADOKA/servlet.starweb?path=PADOKA/LISSH.web&AdvancedSuche
LEGISLATIVE_TERMS
:current: 7
earliest: 1
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Thüringer Landtag
memorious run th
The scraper uses the parldok [5.6.5] implementation using this form: http://parldok.thueringen.de/ParlDok/formalkriterien/
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
LEGISLATIVE_TERMS
:current: 7
earliest: 1
Dokumentations- und Informationssystem für Parlamentsmaterialien - API
memorious run dip
There is a really nice api. The scraper uses this base url (with the public api key): https://search.dip.bundestag.de/api/v1/drucksache?apikey=GmEPb1B.bfqJLIhcGAsH9fTJevTglhFpCoZyAAAdhp&f.zuordnung=BT
DOCUMENT_TYPES
:minor_interpellation
major_interpellation
Parlamentsspiegel (gemeinsames Informationssystem der Landesparlamente)
memorious run parlamentsspiegel
The "Parlamentsspiegel" is an official aggregator page for the document systems of the german state parliaments.
The scraper uses this index page with configurable get parameters: https://www.parlamentsspiegel.de/home/suchergebnisseparlamentsspiegel.html?view=kurz&sortierung=dat_desc&vorgangstyp=ANFRAGE&datumVon=15.05.2021
The "Parlamentsspiegel" doesn't distinguish between minor and major
interpellations for the requests, so the DOCUMENT_TYPES
option is not
available.
Ausarbeitungen der Wissenschaftlichen Dienste des Deutschen Bundestages
memorious run sehrgutachten
Other than the name suggests, it's not technical based on https://sehrgutachten.de but scrapes the website of the bundestag directly.
This scraper scrapes documents from the Wissenschaftliche Dienste directly using and parsing this ajax call: https://www.bundestag.de/ajax/filterlist/de/dokumente/ausarbeitungen/474644-474644/?limit=10
There is no option DOCUMENT_TYPES
and LEGISLATIVE_TERMS
but START_DATE
and END_DATE
are available.
Verfassungsschutzberichte des Bundes und der Länder
memorious run vsberichte
Scraped from the api from https://vsberichte.de
This scraper doesn't need to run frequently as there is a new report once in a year.
There are no filter options available.
The scrapers are based upon memorious
Therefore, for each scraper there is a yaml file in ./dokukratie/
that
defines how the scraper should run.
Some scrapers work with just a yaml definition, like Bayern: ./dokukratie/by.yml
Some others have their own custom python implementation, like Baden-Württemberg: ./dokukratie/scrapers/bw.py
Some others share the same software for their document database backend/frontend, mainly starweb or parldok
Used by:
Code: ./dokukratie/scrapers/starweb.py
Used by:
Code: ./dokukratie/scrapers/parldok.py
The scrapers generate a metadata database for mmmeta to consume.
This is useful for client applications to track state of files without downloading the actual files, e.g. to know which files are already consumed and to only download newer ones, etc...
How to use mmmeta
for dokukratie:
Current used version: 0.4.0
pip install mmmeta
aws s3 sync s3://<bucket_name>/<scraper_name>/_mmmeta ./data/<scraper_name>
This will download the necessary metadata csv files (./db/
) and config.yml
Either use env var MMMETA=./data/<scraper_name>
or jump into the base
directory ./data/<scraper_name>
where the subdirectory _mmmeta
exists.
mmmeta update
or, within python applications:
from mmmeta import mmmeta
# init:
m = mmmeta() # env var MMMETA
# OR
m = mmmeta("./data/<scraper_name>")
# update (or generate) local state
m.update()
If this runs into sqlalchemy migration problems, there is an attempt to fix it
(perhaps make a backup of the local state.db
before):
mmmeta update --cleanup
or, within python applications:
m.update(cleanup=True)
This will cleanup data in the state.db
according to config.yml
but will
leave columns starting with an underscore untouched.
soft delete files (not existing in the s3 bucket for some reason...) are
marked with __deleted=1
and have a __deleted_reason
property.
for file in m.files:
# `file` has metadata as dictionary keys, e.g.:
publisher = file["publisher"]
# ...
# s3 location:
file.remote.uri
# alter state data, e.g.:
# as a convention, local state data should start with _
# to not confuse it with the remote metadata
file["_foo"] = bar
file["_downloaded"] = True
file.save()
make install
additional dependencies for local development:
make install.dev
additional dependencies for production deployment (i.e. psycopg2
):
make install.prod
Install test utils:
make install.test
Then,
make test
This will run through all the scrapers (see details in
./tests/test_scrapers.py
) with different combinations of input parameters and
stop after the first document downloaded.
Or, to test only a specific scraper:
make test.<scraper_name>
Test all scrapers with the starweb implementation:
make test.starweb
Test all scrapers with the parldok implementation:
make test.parldok