openeduhub / oeh-search-etl

The Backend includes all data for the ETL process (Scrapy, Postgres, Elasticsearch)
7 stars 9 forks source link

Crawler for BNE-Portal.de (+ more flexible playwright controls for cookies / ad-blocking) #106

Closed Criamos closed 2 months ago

Criamos commented 2 months ago

This PR includes the following changes:

Code Example: using a spiders custom_settings-attribute to pass cookie data and enable the ad blocker

# example from bne_portal_spider.py

# playwright expects an array of cookies, which can be constructed as a list[dict] with "name" and "value" pairs
playwright_cookies: list[dict] = [  
    {  
        "name": "gsbbanner",
        "value": "closed"  # transmitting this cookie attribute during HTTP requests is one (of two) required cookies that allow us to skip the rendering of an (obtrusive) cookie banner on BNE-Portal.de
    }  
]
custom_settings = {
    "PLAYWRIGHT_ADBLOCKER": True,  # enables uBlock Origin (disabled by default) within the dockerized headless browser
    "PLAYWRIGHT_COOKIES": playwright_cookies,  # makes the cookie data acessible within pipelines.py (ProcessThumbnailPipeline) for individual requests with the headless browser
}

While the pipelines will automatically use the provided custom_settings-dict, you can also (manually) use these controls within the getUrlData-method of our WebTools-class (see: converter/web_tools.py):

from converter.web_tools import WebTools, WebEngine

playwright_cookies: list[dict] = [
        {
            "name": "gsbbanner",
            "value": "closed"
        }
    ]

async def parse():
    playwright_result: dict = await WebTools.getUrlData(
        url="https://example.com",
        engine=WebEngine.Playwright,
        cookies=playwright_cookies,
        adblock=True)