404 on message sending page (wg-gesucht)

tbrodbeck commented 4 years ago

Screenshot 2020-07-12 at 22 42 02 I think they have changed the structure of the link structure (including the title string of the offer as on the left side). If someone could confirm that I might implement a fix for this!

nickirk commented 4 years ago

Yes I can confirm that. Initially the link format is fixed, with the offer id at the end. Now they also include the name of the offer, which differs from offer to offer. The solution will be find the "NACHRICHT SENDEN" button in the source, and click on that directly. There are some examples in the code already on how to find a button. If you could follow the example and fix it, it would be great. Screenshot 2020-07-15 at 12 52 59

tbrodbeck commented 4 years ago

Allright! I just looked into the code. Can you maybe tell me how to access the urls of the offers? I think the scaper only returns the IDs - I am not familiar with scrapy.

nickirk commented 4 years ago

replace this line in wg-gesucht-spider.py

for quote in response.css('div.offer_list_item::attr(data-id)').extract():

with

for quote in response.css('h3.truncate_title a::attr(href)').extract():

you'll have all the url's of the offers. But please pay attention to the following two points:

the first few url's sometimes are advertisement from companies like airbnb, so make sure to rule out them from the real offers. This can be done by matching keywords like "airbnb" in the link, if you find it, discard it, i.e. in pseudo-code
```
for quote in response.css('h3.truncate_title a::attr(href)').extract():
if inside of quote contains airbnb:
    continue
yield {
                "data-id": quote
          }
```
in submit_wg.py file, replace line 11,
```
driver.get('https://www.wg-gesucht.de/nachricht-senden.html?message_ad_id='+ref)
```
with
```
driver.get('https://www.wg-gesucht.de/nachricht-senden/'+ref)
```
will directly directs you to the message sending page, where the ref variable contains the url's you get from the scraper and it looks like this : 1-zimmer-wohnungen-in-Stuttgart-Bad-Cannstatt.8106474.html

Good luck hacking. If you need more help, don't hesitate to ask me. I will be happy to see you make this thing work again.

tbrodbeck commented 4 years ago

Thanks you! That was already really helpful. After fiddling a bit around with selenium I got it working again!

But still there is one odd bug: Scrapy does not scrape the correct webpage. I am not sure why that is. Somehow it does apply the filters (e.g. https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2) all correct - but it does not select the correct location. So I will get all results spread over Berlin of regions I had filtered out. I could reproduce this issue (irregularly and seldomly) by opening the filtered link in a private window.

Maybe you have an idea why that happens and what I could look into?

nickirk commented 4 years ago

Have compared the actual results on the webpage and the results scrapy got? Maybe it is not a problem on the scrapy side but rather the website itself?

Also did you update the link inside of the spider, which should be under your working directory?

tbrodbeck commented 4 years ago

Can you please try to press reload in your browser and tell me if the website changes?

nickirk commented 4 years ago

I tried, but all I see are the same offers from Berlin Mitte. What should I expect?

On 16 Jul 2020, at 13:35, Till notifications@github.com wrote:

Can you try to press reload in your browser?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nickirk/immo/issues/8#issuecomment-659353787, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXNBODE2UJBN4U5KPZAHLR33QZNANCNFSM4OX6645Q.

tbrodbeck commented 4 years ago

Oh okay then the filter actually works in this case (I filtered for Mitte and FHain,XBerg).

The issue that sometimes happens (and I think this happens to scrapy every time) is that the location filter is not loaded correctly. After I have simply refreshed the page on my iPad and then the filter "STADTTEILE" is then loaded: RPReplay-Final1594902713

nickirk commented 4 years ago

Interesting, could you make a pull request so that I can merge your code? I can then look into it by running the script?

tbrodbeck commented 4 years ago

I just reproduced it with beautiful soup:

import bs4
import requests

baseUrl = 'https://www.wg-gesucht.de/wg-zimmer-in-Berlin.8.0.1.0.html?user_filter_id=3881821&offer_filter=1&city_id=8&noDeact=1&sMin=15&wgSea=2&wgAge=28&img_only=1&ot=85079%2C163&categories%5B0%5D=0&rent_types%5B0%5D=2'
page = requests.get(baseUrl)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
for h3 in soup.find_all('h3',class_='truncate_title'):
  for a in h3.find_all('a'):
    print(a['href'])

https://airbnb.pvxt.net/c/1216694/264339/4273?u=www.airbnb.de/s/Berlin/homes&p.checkin=2020-08-01&p.checkout=2020-08-31&sharedid=notemp_Berlin_1_desk&param1=de_wg_4
wg-zimmer-in-Berlin-Dahlem.4044959.html
wg-zimmer-in-Berlin-Pankow.7771731.html
wg-zimmer-in-Berlin-Mitte.8098905.html
wg-zimmer-in-Berlin-Neukoelln.5373875.html
wg-zimmer-in-Berlin-Koepenick.4691030.html
wg-zimmer-in-Berlin-Charlottenburg.8123501.html
wg-zimmer-in-Berlin-Charlottenburg.8110089.html
wg-zimmer-in-Berlin-Charlottenburg.8095287.html
wg-zimmer-in-Berlin-Zehlendorf.7384968.html
wg-zimmer-in-Berlin-Friedrichshain-Kreuzberg.8127431.html
wg-zimmer-in-Berlin-Lichtenberg.8107841.html
wg-zimmer-in-Berlin-MITTE.6126245.html
wg-zimmer-in-Berlin-Neukoelln.6365392.html
wg-zimmer-in-Berlin-Neukoelln.8042369.html
wg-zimmer-in-Berlin-Neukoelln.4934132.html
wg-zimmer-in-Berlin-Friedrichshain.8122261.html
wg-zimmer-in-Berlin-Mitte.8130226.html
wg-zimmer-in-Berlin-Adlershof.5626460.html
wg-zimmer-in-Berlin-Friedrichshain.3514132.html
wg-zimmer-in-Berlin-Zehlendorf.8127837.html

nickirk / immo

404 on message sending page (wg-gesucht) #8