Closed tbrodbeck closed 1 year ago
Yes I can confirm that. Initially the link format is fixed, with the offer id at the end. Now they also include the name of the offer, which differs from offer to offer. The solution will be find the "NACHRICHT SENDEN" button in the source, and click on that directly. There are some examples in the code already on how to find a button. If you could follow the example and fix it, it would be great.
Allright! I just looked into the code. Can you maybe tell me how to access the urls of the offers? I think the scaper only returns the IDs - I am not familiar with scrapy.
replace this line in wg-gesucht-spider.py
for quote in response.css('div.offer_list_item::attr(data-id)').extract():
with
for quote in response.css('h3.truncate_title a::attr(href)').extract():
you'll have all the url's of the offers. But please pay attention to the following two points:
for quote in response.css('h3.truncate_title a::attr(href)').extract():
if inside of quote contains airbnb:
continue
yield {
"data-id": quote
}
driver.get('https://www.wg-gesucht.de/nachricht-senden.html?message_ad_id='+ref)
with
driver.get('https://www.wg-gesucht.de/nachricht-senden/'+ref)
will directly directs you to the message sending page, where the ref variable contains the url's you get from the scraper and it looks like this : 1-zimmer-wohnungen-in-Stuttgart-Bad-Cannstatt.8106474.html
Good luck hacking. If you need more help, don't hesitate to ask me. I will be happy to see you make this thing work again.
Thanks you! That was already really helpful. After fiddling a bit around with selenium I got it working again!
But still there is one odd bug: Scrapy does not scrape the correct webpage. I am not sure why that is. Somehow it does apply the filters (e.g. https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2) all correct - but it does not select the correct location. So I will get all results spread over Berlin of regions I had filtered out. I could reproduce this issue (irregularly and seldomly) by opening the filtered link in a private window.
Maybe you have an idea why that happens and what I could look into?
Have compared the actual results on the webpage and the results scrapy got? Maybe it is not a problem on the scrapy side but rather the website itself?
Also did you update the link inside of the spider, which should be under your working directory?
Can you please try to press reload in your browser and tell me if the website changes?
I tried, but all I see are the same offers from Berlin Mitte. What should I expect?
On 16 Jul 2020, at 13:35, Till notifications@github.com wrote:
Can you try to press reload in your browser?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nickirk/immo/issues/8#issuecomment-659353787, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXNBODE2UJBN4U5KPZAHLR33QZNANCNFSM4OX6645Q.
Oh okay then the filter actually works in this case (I filtered for Mitte and FHain,XBerg).
The issue that sometimes happens (and I think this happens to scrapy every time) is that the location filter is not loaded correctly. After I have simply refreshed the page on my iPad and then the filter "STADTTEILE" is then loaded:
Interesting, could you make a pull request so that I can merge your code? I can then look into it by running the script?
I just reproduced it with beautiful soup:
import bs4
import requests
baseUrl = 'https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2'
page = requests.get(baseUrl)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
for h3 in soup.find_all('h3',class_='truncate_title'):
for a in h3.find_all('a'):
print(a['href'])
https://airbnb.pvxt.net/c/1216694/264339/4273?u=www.airbnb.de/s/Berlin/homes&p.checkin=2020-08-01&p.checkout=2020-08-31&sharedid=notemp_Berlin_1_desk¶m1=de_wg_4
wg-zimmer-in-Berlin-Dahlem.4044959.html
wg-zimmer-in-Berlin-Pankow.7771731.html
wg-zimmer-in-Berlin-Mitte.8098905.html
wg-zimmer-in-Berlin-Neukoelln.5373875.html
wg-zimmer-in-Berlin-Koepenick.4691030.html
wg-zimmer-in-Berlin-Charlottenburg.8123501.html
wg-zimmer-in-Berlin-Charlottenburg.8110089.html
wg-zimmer-in-Berlin-Charlottenburg.8095287.html
wg-zimmer-in-Berlin-Zehlendorf.7384968.html
wg-zimmer-in-Berlin-Friedrichshain-Kreuzberg.8127431.html
wg-zimmer-in-Berlin-Lichtenberg.8107841.html
wg-zimmer-in-Berlin-MITTE.6126245.html
wg-zimmer-in-Berlin-Neukoelln.6365392.html
wg-zimmer-in-Berlin-Neukoelln.8042369.html
wg-zimmer-in-Berlin-Neukoelln.4934132.html
wg-zimmer-in-Berlin-Friedrichshain.8122261.html
wg-zimmer-in-Berlin-Mitte.8130226.html
wg-zimmer-in-Berlin-Adlershof.5626460.html
wg-zimmer-in-Berlin-Friedrichshain.3514132.html
wg-zimmer-in-Berlin-Zehlendorf.8127837.html
I think they have changed the structure of the link structure (including the title string of the offer as on the left side). If someone could confirm that I might implement a fix for this!