Scraping metadata of documents by document reference

sandervh14 commented 6 months ago

Scrape metadata, including the urls of documents so we can fetch them, for each of the document references we extracted from processing the plenary reports.

Example metadata: https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=/site/wwwcfm/flwb/flwbn.cfm?legislat=55&dossierID=3495

Was found by entering document reference 3495 in the search bar on the page providing the full overview of documents: https://www.dekamer.be/kvvcr/showpage.cfm?section=/flwb&language=nl&cfm=ListDocument.cfm.

But looking at the first URL mentioned above, we will be able to scrape the metadata and documents simply by filling in the legislature and document reference in the first URL, the second URL we won't need for scraping.

karel1980 commented 6 months ago

I've added a command to download the referenced documents. This requires that you first produce a 'plenaries.json'. So assuming you've already downloaded the plenaries html files:

td-plenaries-json
td-download-referenced-documents

karel1980 commented 6 months ago

There are some references (e.g MOT nr 483 in plenary 298) that we currently don't parse. Let's keep this ticket open until we figure out what those are even referencing.

transparentdemocracy / voting-data

Scraping metadata of documents by document reference #36