milahu / aiohttp_chromium

aiohttp-like interface to chromium. based on selenium_driverless to bypass cloudflare
MIT License
46 stars 7 forks source link

lets collaborate #1

Open milahu opened 10 months ago

milahu commented 10 months ago

@kaliiiiiiiiii this project is largely based on your Selenium-Driverless would you be interested in collaboration? (spoiler: i will use MIT license)

i have not-yet found an actual "headful-web-scraper" where i can simply remote-control an actual chromium browser to allow "semi-automatic web scraping" (solving captchas, debugging error states) so i created my own : )

so far, my code is unreleased im using it in my opensubtitles-scraper to bypass cloudflare

so far, my code (fetch-subs.py) is really messy and it will need some serious refactoring from 8000 lines in one file, to modules and classes

my goal is to make chromium usable just like any other http client in python as a drop-in replacement for aiohttp

i have a working prototype for handling file downloads (and html error pages) but i guess that will be too complex / out of scope for Selenium-Driverless see also Selenium-Driverless#140

kaliiiiiiiiii commented 10 months ago

@milahu

would you be interested in collaboration? (spoiler: i will use MIT license)

Surely, why not:)

i have a working prototype for handling file downloads (and html error pages) but i guess that will be too complex / out of scope for Selenium-Driverless see also Selenium-Driverless#140

https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/140 will be resolved in some future for sure - and file downloading, html error-pages etc. wouldn't be an issue to implement into driverless. However, I see the point of wanting to have a lightweight, stable, fast aiohttp-like browser. And ofc being completely open-source hehe. Already at this point, I see the an issue abt when to know I a page has loaded yet completely, as pages in edge-cases for example can have forever-loading Iframes, iframes loaded after content-load etc. Waiting for elements etc. would then already go towards more complex automation (=> driverless, not aiohttp-like)

so far, my code (fetch-subs.py) is really messy and it will need some serious refactoring from 8000 lines in one file, to modules and classes

I assume you're talking about opensubtitles-scraper/blob/main/fetch-subs.py. So therefore the plan is to use Pyppeteer? I tbh really wouldn't recommend using it. Puppeteer (& Pyppeteer) just isn't made to be undetectable. I'd recommend relying on bare CDP, possibly using CDP-Socket (no worries, I plan to make it GNU//MIT anyways:) )

What I have to note here tho:

  1. Currently I'm quite maxed-out & therefore won't find a lot of time in near future. Feel free to lmk if you need anything tho.
  2. Let's keep this professional - I don't judge or discuss about any political or personnal stuff on here
  3. I havent worked with auto-generated documentations yet - might need some time getting into that//you might provide some structure on that to start on.
kaliiiiiiiiii commented 10 months ago

Or did you mean that you'd like to use driverless as a base for this project?

milahu commented 10 months ago

so far, my code is unreleased

if you want to see my current mess: milahu@gmail.com

file downloading, html error-pages etc. wouldn't be an issue to implement into driverless.

maybe...

I see the an issue abt when to know I a page has loaded yet completely

more complex automation (=> driverless, not aiohttp-like)

true, this would be more than a stupid http client

its a challenge to reduce this complexity into a few lines of code

the http client would need some model of the http server to predict possible responses

the goal is to autosolve complex challenges like

also automate pagination (or infinite scroll)

im pretty sure that something like this exists somewhere... for example apify has a similar goal, to translate html responses to json responses or "web archive" services will have such challenge-solvers, to "click through" to the content

kaliiiiiiiiii commented 10 months ago

if you want to see my current mess: milahu@gmail.com

I'll send you an E-Mail.

translate html responses to json responses Huh how'd you wanna do that? Run some model on it? Maintain for current frameworks//antibots//patterns?

I suppose that some basic wait for content load with bare CDP its a start at some point. Considerations might be:

  1. wait for regex html match
  2. wait for regex url match (redirect-urls support)
  3. wait for iframes mentioned on any above mentioned conditions.(optionally)
milahu commented 10 months ago

the http client would need some model of the http server

the user would have to provide that model of the http server with all the if/then/else/match/retry/... logic

milahu commented 9 months ago

im pretty sure that something like this exists somewhere...

yepp, i have reinvented botasaurus

kaliiiiiiiiii commented 9 months ago

im pretty sure that something like this exists somewhere...

yepp, i have reinvented botasaurus

botasaurus uses selenium internally. Also, JaveScript execution such as https://github.com/omkarcloud/botasaurus/blob/dba618c26da74263cc4af33a13faf41cb7a30ae3/botasaurus/anti_detect_driver.py#L216 for sure is detectable.

kaliiiiiiiiii commented 9 months ago

looked shortly into the code you've got so far. Stuff I notice here:

  1. UBlock Extension to my knowledge is pretty invasive (js execution, network blocking, etc.) and I'm pretty sure it's detetacble. Therefore I'd not recommend adding it by default (if that is the case?)
  2. I like the arguments & preferences you've added. Might have a closer look at them for my usages as well.
  3. You might consider support for context(incognito). While doing multiple request in the same context, all cookies will be shared. Also, pasing extra headers might be a nice feature
  4. For supporting streaming, you might consider using Network.takeResponseBodyForInterceptionAsStream
milahu commented 9 months ago

i have reinvented botasaurus

actually no, aiohttp_chromium is more low-level than botasaurus aiohttp_chromium is really just a drop-in replacement for aiohttp and selenium features are hidden under response._driver

also botasaurus fails to run on my nixos machine, see https://github.com/omkarcloud/botasaurus/issues/40 meanwhile, aiohttp_chromium just works also because selenium_driverless is a pure-python library so no webdriver, and no node process to eval javascript

passing extra headers might be a nice feature

the goal is to autosolve complex challenges

middlewares is the term i was looking for to intercept and modify requests and responses

also scrapy has support for middlewares but scrapy is too high-level for my taste, similar to botasaurus

the aiohttp client has only support for passive tracing but since its just a dumb http client where one request gives only one response (plus redirects) such request/response interception is not needed

the aiohttp server has support for middlewares

A middleware is a coroutine that can modify either the request or response

Every middleware should accept two parameters, a request instance and a handler, and return the response or raise an exception

in aiohttp_chromium this could look like

import asyncio

import aiohttp_chromium as aiohttp

async def main():

    async with aiohttp.ClientSession() as session:

        async def middleware_1(request, handler):
            print("middleware_1")
            request.headers["test"] = "hello"
            request.cookies["some_key"] = "some value"
            # send request, get response
            respone = await handler(request)
            response.text = response.text + ' wink'
            return response

        args = dict(
            _middlewares=[
                middleware_1,
            ]
        )

        url = "http://httpbin.org/get"

        async with session.get(url, **args) as response:
            print(response.status)
            print(await response.text())

asyncio.run(main())

this is also useful to block requests example: dont load images / styles / scripts / ads / ...

UBlock Extension to my knowledge is pretty invasive

the ads on many websites are "pretty invasive" too many normal browsers have ublock, so thats no sign of a bot

I like the arguments & preferences you've added

im surprised that i have to add --enable-features=WebContentsForceDark to actually enable darkmode for websites otherwise only the chromium UI is dark, and websites are light i would call this a chromium bug, but probably its "default off" for better performance

my self._chromium_config has some "reasonable defaults" _chromium_config will be exposed in the session constructor

    args = dict(
        _chromium_config = {
            "bookmark_bar": {
                # disable bookmarks bar
                "show_on_all_tabs": False,
            },
        },
    )
    async with aiohttp.ClientSession(**args) as session:

generally, all chromium options will be exposed because different users have different needs

You might consider support for context

with aiohttp i would create multiple sessions with different cookie_jar, different request headers, ...

i guess that creating an incognito window is not more efficient that starting a new chromium process

currently the start is slow, because i wait 20 seconds for ublock update but the start time can be reduced by using persistent user-data-dir for chromium one user-data-dir for every session

Network.takeResponseBodyForInterceptionAsStream

probably i will use this by default instead of Network.getResponseBody because ahead of time, i dont know whether a response is a document or an infinite stream (long poll)

kaliiiiiiiiii commented 9 months ago

currently the start is slow, because i wait 20 seconds for ublock update

Uhh you mean fetch the extension? Each time? pretty sure versioning should be possible to implement.

example: dont load images / styles / scripts / ads / ... f

good point

Network.takeResponseBodyForInterceptionAsStream

probably i will use this by default instead of Network.getResponseBody

Yep, I'd propose that as well. Additionally, there's maximum size for python websockets (technically overridable)

kaliiiiiiiiii commented 9 months ago

@milahu Also, you might consider using threading at: https://github.com/milahu/aiohttp_chromium/blob/fc15ea609822fceeb4976cf8eb84967b22d7d4d8/src/aiohttp_chromium/extensions.py#L121-L123 just to be safe for asyncio. This applies as well for shutil and reading//writing files. aiofiles might be considerable here. It's anyways a dependency for driverless.

milahu commented 9 months ago

Uhh you mean fetch the extension? Each time? pretty sure versioning should be possible to implement.

no. the extension zip is already cached to self._extensions_cache_path which by default is $HOME/.cache/aiohttp_chromium/extensions/

what takes so long is the update of ublock, visible by the orange ublock icon when i send requests too early, then ublock is not working on update, ublock is downloading filter lists from uBlock/assets/assets.json

i have added caching of extensions state in 686b19fdcf9799527c750a448b72ef30024d794e now ublock starts in about 5 seconds (versus 30 seconds cold start)

ublock options

Storage used: 23.7 MB

115,558 network filters + 44,343 cosmetic filters

this data is stored in levelDB databases in {user_data_dir}/Default/Local Extension Settings/{ext_id}/

Suspend network activity until all filter lists are loaded

aka suspendUntilListsAreLoaded with default setting in js/background.js

you might consider using threading

i dont see how the unzip code could break anything this runs sequentially to unpack extensions to user-data-dir

to be safe for asyncio

this could be more relevant when reading downloaded files in response.content etc currently, this is just a "quickfix" solution which is also not compatible with aiohttp await response.content.read() because currently response.content.read is a sync method

https://github.com/milahu/aiohttp_chromium/blob/fc15ea609822fceeb4976cf8eb84967b22d7d4d8/src/aiohttp_chromium/client.py#L411-L429

kaliiiiiiiiii commented 9 months ago

i dont see how the unzip code could break anything this runs sequentially to unpack extensions to user-data-dir

For cases where the disk is slow and multiple Chrome instances are started, I suppose this could cause to long blocking coroutines for asyncio.

milahu commented 9 months ago

sounds like low priority stuff having write-locks for saving extensions state would be more important or having atomic writes when moving downloaded files from tmpfs to disk

meanwhile, scraper goes brrr ; )

but opensubtitles.org is easy to scrape... currently im handling 2K requests per day

Screenshot_20240118_004624

ZakariaMQ commented 6 months ago

@kaliiiiiiiiii @milahu just a little addition from me the new headless mode of puppeteer has a nearly regular fingerprint as a real Chrome browser I was able to pass with it so many cloudflare protected website and "Antoine Vastel, PhD, Head of Research at DataDome" Admit it in his last interview

adding to puppeteer some custom patches + some mouse movements can be game changer

also, there is a new protocol rather than CDP which is WebDriver BiDi introduced by Google more info here https://developer.chrome.com/blog/webdriver-bidi/

milahu commented 6 months ago

WebDriver BiDi

thanks for sharing the good news

WebDriver BiDi promises bi-directional communication, making it fast by default, and it comes packed with low-level control.

i hope they will finally implement hooking into http streams so we can use chromium as a full http client see also https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/123#issuecomment-1912534536

see also

the new headless mode of puppeteer

i still prefer a headful chromium browser, running on my desktop machine which allows (in theory) semi-automatic scraping, asking the user to solve captchas

i tried to run chromium in an xvnc server, to only show it when needed but chromium in xvnc fails to bypass cloudflare somehow the rendering in xvnc is slower than on the main desktop i guess cloudflare wants to block exactly this use case