tilman151 / scrape-my-bike

Repository to scrape eBay Kleinanzeigen for bikes
2 stars 0 forks source link

Consider using native requests #1

Open BastelPichi opened 2 years ago

BastelPichi commented 2 years ago

Hey, Awesome project! Just an idea: using native requests instead of selenium saves lots of CPU and RAM.

Ebay Kleinanzeigen is rather unprotected - theres only an user agent filter, which can be bypassed easely (if in blacklist, shows message that IP range has been banned - dont get confused...). That way its easy to access the page - even without js capabilities.

Small concept below:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Opera/8.96 (Windows NT 5.01; sl-SI) Presto/2.9.199 Version/11.00"
}

r = requests.get("https://www.ebay-kleinanzeigen.de/s-berlin/fahrrad/k0l3331", headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

for i in soup.find_all(attrs={"class": "lazyload-item"}):
    print(i)

In the html, you can also find things as the description, or, via the class, also detect wether its an ad or top.

I can open a pull request tomorrow (maybe).

tilman151 commented 2 years ago

Sounds nice. I talked with some folks at Kleinanzeigen already and they confirmed that their API is unprotected. Unfortunately, I never found the time to look into it. Selenium was necessary, as some stuff, like cookie banners, were JS and had to be dismissed.

A PR would be awesome, especially as the selenium solution is incredibly brittle.

BastelPichi commented 2 years ago

Sounds nice. I talked with some folks at Kleinanzeigen already and they confirmed that their API is unprotected. Unfortunately, I never found the time to look into it. Selenium was necessary, as some stuff, like cookie banners, were JS and had to be dismissed.

Well, its not using the API. just getting the page without executing js. As the cookie banner gets opened by js, and blocks scrolling by js, and the articles are already loaded in the background, theres actually no need to even execute it in the first way.

However ill also sniff http traffic from the app today, and try to reverse ingeneer the actual API.

BastelPichi commented 2 years ago

Small update: Got the API to work, just gotta check if the token is fixed in the app, or unique for every installation.

BastelPichi commented 2 years ago

Cracked the API, should be able to commit tomorrow. (the api is very great)

tilman151 commented 2 years ago

Great, looking forward to reading your PR 😁

tilman151 commented 2 years ago

So, how is it going?

BastelPichi commented 2 years ago

Been rather busy with other stuff, but here's what i have for now, should be rather easy to implement into the current system. I've added lots of comments...

https://gist.github.com/BastelPichi/43e441f166fcd6a4c76f875dcbb91d5c