Open milahu opened 10 months ago
@milahu
would you be interested in collaboration? (spoiler: i will use MIT license)
Surely, why not:)
i have a working prototype for handling file downloads (and html error pages) but i guess that will be too complex / out of scope for Selenium-Driverless see also Selenium-Driverless#140
https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/140 will be resolved in some future for sure - and file downloading, html error-pages etc. wouldn't be an issue to implement into driverless. However, I see the point of wanting to have a lightweight, stable, fast aiohttp-like browser. And ofc being completely open-source hehe. Already at this point, I see the an issue abt when to know I a page has loaded yet completely, as pages in edge-cases for example can have forever-loading Iframes, iframes loaded after content-load etc. Waiting for elements etc. would then already go towards more complex automation (=> driverless, not aiohttp-like)
so far, my code (fetch-subs.py) is really messy and it will need some serious refactoring from 8000 lines in one file, to modules and classes
I assume you're talking about opensubtitles-scraper/blob/main/fetch-subs.py. So therefore the plan is to use Pyppeteer? I tbh really wouldn't recommend using it. Puppeteer (& Pyppeteer) just isn't made to be undetectable. I'd recommend relying on bare CDP, possibly using CDP-Socket (no worries, I plan to make it GNU//MIT anyways:) )
What I have to note here tho:
Or did you mean that you'd like to use driverless as a base for this project?
so far, my code is unreleased
if you want to see my current mess: milahu@gmail.com
file downloading, html error-pages etc. wouldn't be an issue to implement into driverless.
maybe...
I see the an issue abt when to know I a page has loaded yet completely
more complex automation (=> driverless, not aiohttp-like)
true, this would be more than a stupid http client
its a challenge to reduce this complexity into a few lines of code
the http client would need some model of the http server to predict possible responses
the goal is to autosolve complex challenges like
also automate pagination (or infinite scroll)
im pretty sure that something like this exists somewhere... for example apify has a similar goal, to translate html responses to json responses or "web archive" services will have such challenge-solvers, to "click through" to the content
if you want to see my current mess: milahu@gmail.com
I'll send you an E-Mail.
translate html responses to json responses Huh how'd you wanna do that? Run some model on it? Maintain for current frameworks//antibots//patterns?
I suppose that some basic wait for content load with bare CDP its a start at some point. Considerations might be:
the http client would need some model of the http server
the user would have to provide that model of the http server with all the if/then/else/match/retry/... logic
im pretty sure that something like this exists somewhere...
yepp, i have reinvented botasaurus
im pretty sure that something like this exists somewhere...
yepp, i have reinvented botasaurus
botasaurus uses selenium internally. Also, JaveScript execution such as https://github.com/omkarcloud/botasaurus/blob/dba618c26da74263cc4af33a13faf41cb7a30ae3/botasaurus/anti_detect_driver.py#L216 for sure is detectable.
looked shortly into the code you've got so far. Stuff I notice here:
Network.takeResponseBodyForInterceptionAsStream
i have reinvented botasaurus
actually no, aiohttp_chromium
is more low-level than botasaurus
aiohttp_chromium
is really just a drop-in replacement for aiohttp
and selenium features are hidden under response._driver
also botasaurus
fails to run on my nixos machine, see https://github.com/omkarcloud/botasaurus/issues/40
meanwhile, aiohttp_chromium
just works
also because selenium_driverless
is a pure-python library
so no webdriver
, and no node
process to eval javascript
passing extra headers might be a nice feature
the goal is to autosolve complex challenges
middlewares
is the term i was looking for
to intercept and modify requests and responses
also scrapy has support for middlewares
but scrapy
is too high-level for my taste, similar to botasaurus
the aiohttp
client has only support for passive tracing
but since its just a dumb http client
where one request gives only one response (plus redirects)
such request/response interception is not needed
the aiohttp
server has support for middlewares
A middleware is a coroutine that can modify either the request or response
Every middleware should accept two parameters, a
request
instance and ahandler
, and return the response or raise an exception
in aiohttp_chromium
this could look like
import asyncio
import aiohttp_chromium as aiohttp
async def main():
async with aiohttp.ClientSession() as session:
async def middleware_1(request, handler):
print("middleware_1")
request.headers["test"] = "hello"
request.cookies["some_key"] = "some value"
# send request, get response
respone = await handler(request)
response.text = response.text + ' wink'
return response
args = dict(
_middlewares=[
middleware_1,
]
)
url = "http://httpbin.org/get"
async with session.get(url, **args) as response:
print(response.status)
print(await response.text())
asyncio.run(main())
this is also useful to block requests example: dont load images / styles / scripts / ads / ...
UBlock Extension to my knowledge is pretty invasive
the ads on many websites are "pretty invasive" too many normal browsers have ublock, so thats no sign of a bot
I like the arguments & preferences you've added
im surprised that i have to add --enable-features=WebContentsForceDark
to actually enable darkmode for websites
otherwise only the chromium UI is dark, and websites are light
i would call this a chromium bug, but probably its "default off" for better performance
my self._chromium_config
has some "reasonable defaults"
_chromium_config
will be exposed in the session constructor
args = dict(
_chromium_config = {
"bookmark_bar": {
# disable bookmarks bar
"show_on_all_tabs": False,
},
},
)
async with aiohttp.ClientSession(**args) as session:
generally, all chromium options will be exposed because different users have different needs
You might consider support for context
with aiohttp
i would create multiple sessions
with different cookie_jar
, different request headers, ...
i guess that creating an incognito window is not more efficient that starting a new chromium process
currently the start is slow, because i wait 20 seconds for ublock update
but the start time can be reduced by using persistent user-data-dir
for chromium
one user-data-dir
for every session
Network.takeResponseBodyForInterceptionAsStream
probably i will use this by default instead of Network.getResponseBody
because ahead of time, i dont know
whether a response is a document or an infinite stream (long poll)
currently the start is slow, because i wait 20 seconds for ublock update
Uhh you mean fetch the extension? Each time? pretty sure versioning should be possible to implement.
example: dont load images / styles / scripts / ads / ... f
good point
Network.takeResponseBodyForInterceptionAsStream
probably i will use this by default instead of Network.getResponseBody
Yep, I'd propose that as well. Additionally, there's maximum size for python websockets (technically overridable)
@milahu Also, you might consider using threading at:
https://github.com/milahu/aiohttp_chromium/blob/fc15ea609822fceeb4976cf8eb84967b22d7d4d8/src/aiohttp_chromium/extensions.py#L121-L123
just to be safe for asyncio.
This applies as well for shutil
and reading//writing files.
aiofiles
might be considerable here. It's anyways a dependency for driverless.
Uhh you mean fetch the extension? Each time? pretty sure versioning should be possible to implement.
no. the extension zip is already cached to self._extensions_cache_path
which by default is $HOME/.cache/aiohttp_chromium/extensions/
what takes so long is the update of ublock, visible by the orange ublock icon
when i send requests too early, then ublock is not working
on update, ublock is downloading filter lists from uBlock/assets/assets.json
i have added caching of extensions state in 686b19fdcf9799527c750a448b72ef30024d794e now ublock starts in about 5 seconds (versus 30 seconds cold start)
ublock options
Storage used: 23.7 MB
115,558 network filters + 44,343 cosmetic filters
this data is stored in levelDB databases in
{user_data_dir}/Default/Local Extension Settings/{ext_id}/
Suspend network activity until all filter lists are loaded
aka suspendUntilListsAreLoaded
with default setting in js/background.js
you might consider using threading
i dont see how the unzip code could break anything
this runs sequentially to unpack extensions to user-data-dir
to be safe for asyncio
this could be more relevant when reading downloaded files in response.content
etc
currently, this is just a "quickfix" solution
which is also not compatible with aiohttp await response.content.read()
because currently response.content.read
is a sync method
i dont see how the unzip code could break anything this runs sequentially to unpack extensions to user-data-dir
For cases where the disk is slow and multiple Chrome instances are started, I suppose this could cause to long blocking coroutines for asyncio.
sounds like low priority stuff having write-locks for saving extensions state would be more important or having atomic writes when moving downloaded files from tmpfs to disk
meanwhile, scraper goes brrr ; )
but opensubtitles.org is easy to scrape... currently im handling 2K requests per day
@kaliiiiiiiiii @milahu just a little addition from me the new headless mode of puppeteer has a nearly regular fingerprint as a real Chrome browser I was able to pass with it so many cloudflare protected website and "Antoine Vastel, PhD, Head of Research at DataDome" Admit it in his last interview
adding to puppeteer some custom patches + some mouse movements can be game changer
also, there is a new protocol rather than CDP which is WebDriver BiDi introduced by Google more info here https://developer.chrome.com/blog/webdriver-bidi/
WebDriver BiDi
thanks for sharing the good news
WebDriver BiDi promises bi-directional communication, making it fast by default, and it comes packed with low-level control.
i hope they will finally implement hooking into http streams so we can use chromium as a full http client see also https://github.com/kaliiiiiiiiii/Selenium-Driverless/issues/123#issuecomment-1912534536
see also
the new headless mode of puppeteer
i still prefer a headful chromium browser, running on my desktop machine which allows (in theory) semi-automatic scraping, asking the user to solve captchas
i tried to run chromium in an xvnc server, to only show it when needed but chromium in xvnc fails to bypass cloudflare somehow the rendering in xvnc is slower than on the main desktop i guess cloudflare wants to block exactly this use case
@kaliiiiiiiiii this project is largely based on your Selenium-Driverless would you be interested in collaboration? (spoiler: i will use MIT license)
i have not-yet found an actual "headful-web-scraper" where i can simply remote-control an actual chromium browser to allow "semi-automatic web scraping" (solving captchas, debugging error states) so i created my own : )
so far, my code is unreleased im using it in my opensubtitles-scraper to bypass cloudflare
so far, my code (fetch-subs.py) is really messy and it will need some serious refactoring from 8000 lines in one file, to modules and classes
my goal is to make chromium usable just like any other http client in python as a drop-in replacement for aiohttp
i have a working prototype for handling file downloads (and html error pages) but i guess that will be too complex / out of scope for Selenium-Driverless see also Selenium-Driverless#140