Publikationen: 127 items expected (as of 2024-08-08)
new: (optional) control settings for the headless browser (via playwright):
feat: passing (customized) cookies from the spider-class to the headless browser
feat: enabling the built-in ad blocker of the "browserless/chromium" (headless browser) container
pipeline fixes for (huge) image downloads
in the past, several edge-cases were observed where unexpectedly huge image files were causing DecompressionBombErrors in pillows image-module, which caused items to be dropped in the image-to-thumbnail conversion process
by doubling the allowed max image size from the default setting (89,5 megapixels) to ~179 MP, we're achieving two things:
the screenshot pipeline should be able to handle (raw) image downloads of images up to ~358 MP before encountering DecompressionBombErrors
if the downloaded image is bigger than this threshold, the pipeline will handle such edge-cases more gracefully by throwing a warning and falling back to a website screenshot (instead of dropping the item altogether)
Code Example: using a spiders custom_settings-attribute to pass cookie data and enable the ad blocker
# example from bne_portal_spider.py
# playwright expects an array of cookies, which can be constructed as a list[dict] with "name" and "value" pairs
playwright_cookies: list[dict] = [
{
"name": "gsbbanner",
"value": "closed" # transmitting this cookie attribute during HTTP requests is one (of two) required cookies that allow us to skip the rendering of an (obtrusive) cookie banner on BNE-Portal.de
}
]
custom_settings = {
"PLAYWRIGHT_ADBLOCKER": True, # enables uBlock Origin (disabled by default) within the dockerized headless browser
"PLAYWRIGHT_COOKIES": playwright_cookies, # makes the cookie data acessible within pipelines.py (ProcessThumbnailPipeline) for individual requests with the headless browser
}
While the pipelines will automatically use the provided custom_settings-dict, you can also (manually) use these controls within the getUrlData-method of our WebTools-class (see: converter/web_tools.py):
This PR includes the following changes:
bne_portal_spider
v0.0.3 for https://www.bne-portal.despider
-class to the headless browserDecompressionBombError
s inpillow
s image-module, which caused items to be dropped in the image-to-thumbnail conversion processDecompressionBombError
sCode Example: using a
spider
scustom_settings
-attribute to pass cookie data and enable the ad blockerWhile the pipelines will automatically use the provided
custom_settings
-dict, you can also (manually) use these controls within thegetUrlData
-method of ourWebTools
-class (see:converter/web_tools.py
):