robbi5 / kleineanfragen

Collecting kleine Anfragen from Parlamentsdokumentationssystemen for easy search- and linkability
https://kleineanfragen.de
MIT License
43 stars 9 forks source link

Update Saarland Scraper #121

Open enricogenauck opened 5 years ago

enricogenauck commented 5 years ago

The Saarland Landtag updated their website and listing of published papers. This pull request tries to adjust the scraper to the new listing.

It adds webmock for higher level testing of general scraper input. This way the scraping itself can be tested as blackbox and the test can get more robust to internal changes and rewrites of the scraper class.

Side note 1: Unfortunately I don't have the needed insight to test the scraper in a real world scenario since I don't understand the surrounding necessities like the body model yet. Maybe there is or there could be an easy to fire up rake command which ensures that the general workflow of scraping and extracting files is still working with a live internet connection? Maybe like an integration test under real world conditions.

Side note 2: I just stumbled upon this project a few days ago and it's brilliant! Thanks for all your work, @robbi5 !

robbi5 commented 5 years ago

Thank you for the PR and sorry for the long delay.

I've added code for the Detail Scraper today, for the Overview Scraper I think a bit of pagination is needed to get older published papers. The method for this would be scrape_paginated (see eg. BayernLandtagScraper) and supports_pagination?.

Rake commands for calling the scrapers already exist, but I see that they could be hard to follow, because they trigger async scraping. For development and debugging synchronous scraping would be better to follow.

My method for testing is currently using the rails console and calling SaarlandScraper::Detail.new(16, 711).scrape{ |x| puts x.inspect }. For a full import of a single paper with the scraping pipeline ImportPaperJob.perform_now(Body.find_by_state('SL'), 16, 711) is used, for all (new) papers I usually call ImportNewPapersJob.perform_now(Body.find_by_state('SL'), 16).