This repo collects the votes in the Bundesrat. For a website that presents the data, check out the Bundesrat Scraper Website as well as a live demo on Render.
The scraper and website, including the data, scraper and website code, are unofficial. The Bundesrat
has nothing to do with it. There is no warranty that the scraped data is correct or complete or the website displays the correct information.
The plan:
bundesrat
contains a scraper that gets the sessions and their agenda items (TOPs) and puts them in a file called sessions.json
.$STATE/sessions_tops.json
file, together with the links to the original documents in $STATE/sessions_urls.json
.Everything is tested unter Python 3.6.8, so please install anaconda to use this version of python. It is optimized for Arch Linux, but should work with other Linux Distros as well. The pdftohtml
dependency is required for the pdfcutter
tool to work.
pip install --user jupyter
yay -S anaconda --noconfirm #Or your package manager
yay -S poppler --noconfirm #Or your package manager, need pdftohtml program
source /opt/anaconda/bin/activate root
conda create -n py368 python=3.6 ipykernel
sudo conda install -y nb_conda_kernels
conda activate py368
sudo conda install -y lxml
pip install --user pdfcutter requests lxml
pip install --user wand #Only needed for the graphical debugger of pdfcutter
sudo ipython3 kernel install #Otherwise Jupyter can't find kernel
To scrape the sessions and their agenda items, connect to the internet and open the bundesrat-scraper with:
source /opt/anaconda/bin/activate py368
jupyter notebook bundesrat-scraper/bundesrat/bundesrat_scraper.ipynb
, and start the code. If you have any problems with import pdfcutter
inside Jupyter, then delete the kernel folder of Jupyter (Kernel path taken from jupyter notebook scraper.ipynb
) and re-open Jupyter.
You want to do this if there was a new bundesrat session you want to scrape. Before this, you might have to delete the session.json
file and the _cache
folder.
To scrape the Abstimmungsverhalten of a state, do:
source /opt/anaconda/bin/activate py368
jupyter notebook bundesrat-scraper/$STATE/scraper.ipynb
, and start the code. If the bundesrat/session.json
file was extended by a session, the Scraper will look for the Abstimmungsverhalten of the new sessions. You might want to disable any VPN because some states won't let you download their documents otherwise.
scraper_$STATE.py
file. This file extends the PDFTextExtractor.py
file, which is the code base for the Scrapers. Glossary.md
file explains the used terminology used in the code and comments.MainBoilerPlate.py
contains the code base for collecting the links to the documents of the states.helper.py
file includes some common methods and adaptations of the pdfcutter
library.selectionVisualizer.py
file contains a method for doing graphical debugging of the pdfcutter
without the need of using Jupyter, but your browser of choice (e.g. Firefox). See Tips section for more information on that.The scraping of the states behaviours consists of four parts:
MainBoilerPlate.py/MainExtractorMethod
)PDFTextExtractor.py/DefaultTOPPositionFinder
). Some common implementations are available in the same file.PDFTextExtractor.py/AbstractSenatsAndBRTextExtractor
). Some common implementations are available in the same file.PDFTextExtractor.py/TextExtractorHolder
), where you can also define different scraping rules for step 2 and 3 according to the current session and TOP.If you want to execute the scraper without Jupyter, you can do
jupyter nbconvert --to script scraper.ipynb
, remove the following lines from the beginning of the new python file:
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
, and execute the python file (after enabling anaconda).
Assuming you have a selection
from your pdfcutter
instance, you can execute the following code to see the selected text parts in a picture:
#import pdfcutter
#cutter=pdfcutter.PDFCutter(filename='./_cache/_download_136845')
debugger = cutter.get_debugger()
#selection = cutter.all().filter(...)
debugger.debug(page).draw(17) # Shows picture of the selected text of `selection` on page 17
Assuming you have a selection
from your pdfcutter
instance, you can execute the following code to see the selected text parts in a picture inside a browser (e.g. Firefox):
#import pdfcutter
#cutter=pdfcutter.PDFCutter(filename='./_cache/_download_136845')
import selectionVisualizer as dVis
#selection = cutter.all().filter(...)
dVis.showCutter(selection, pageNumber=17) # Shows picture of the selected text of `selection` on page 17
If you want to see the direct out that pdfcutter
is working on (e.g. for seeing which words form a chuck or see coordinates for lines ), use
pdftohtml $DOCUMENT.pdf -xml
If you want to see the PDF links the MainBoilerPlate
file sends to the TextExtractorHolder
(i.e. what's part of the session_urls.json
file), use:
print(list(MainExtractorMethod(None)._get_pdf_urls()))
If you want to debug a scraping error inside one session (say 973), you can add the following to lines to the scraper Jupyter file (cell 8):
# if str(num) in session_tops:
# continue
if str(num) != "973":
continue
Alternatively, you can remove session 973 from the appropriate session_tops.json
file and rerun the Jupyter file.