villiamriegler / Pillai-database

2 stars 1 forks source link

Automating Fass data #11

Closed villiamriegler closed 8 months ago

villiamriegler commented 8 months ago

Merging this PR may complete work item DA003. Related document: DA003

Changes

Performance changes

The scraper has been heavily paralellized to run in under an hour. On ~100Mbits/s internet and a Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz we find that scraping all of fass takes about 35 minutes.

The overal methodology for performance is to batch requests in parallel and then scrape the retrived pages in parallel. Thereby scraping while requesting pages without blocking.

From code exploration and reserach I've found that the scraping was mostly IO-bound by the requests. To achive higher perfomance we therefore batch N medecine requests at a time in parallel. This number N was found to be balanced around 1-100, where 1 gives no performance enhancement at all and 100 errors as it overflows memory. Note that 1 medecine request here implies 6 actuall requests as we also batch every page of a single medecine inside, thus 100 medecine requests is actually 600 page requests. Best results where found when N is somewhere between 25-50, as this provides time for almost continous parallelized scraping.

Automated runs

The new scraper has a github workflow assosiated set up to run 00:00 every day, currently the workflow is untested since workflows only can be run from the main branch. Therefore the workflow in this PR pushes changes to the crawler-perf branch to avoid nuking the main branch if something is wrong. Once tested this may be changed to either push to main or a new branch that can be merged when needed.

Side effects

Some medecines do no longer contain the same data as before, this can be explored in the commited data files. I argue that the new implementation gives correct results and that the old version had some incorrect scraping. The reasons behind differences are various however the two most common ones were found to be revisions of the document and that the old scraper would sometimes scrape past the next <a>-tag (ex. src/scrapers/data/products/19571115000028.json below).

The new implementation also does not error under any circumstance and simply retries requests ensuring all data is always gathered.