nickirk / immo

A bot which monitors immoscout24 and wg-gesucht.de for new flat offers and send requests to offers automatically.
GNU General Public License v3.0
146 stars 44 forks source link

Problem writing to href.json #14

Closed mathouthouthou closed 10 months ago

mathouthouthou commented 2 years ago

Hello I have been tryting to set up your bot but I keep having an issue when writing on the href.json file.

First I was getting a 405 status code on the results page on every crawl. I fixed this by randomizing the scrapy user agent as they mention here: https://stackoverflow.com/questions/67401114/how-can-i-use-random-useragent-everytitme-when-i-send-resquest

Now I get a 200 on that request. However I don't see anything written on the href.json file. This is the full trace of one iteration of the scraper:

artsytech-C02FX151MD6R:immobot artsyloaner$ python3 immo.py
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: immobot)
2022-07-03 16:42:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform macOS-11.4-x86_64-i386-64bit
2022-07-03 16:42:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'immobot',
 'LOG_ENABLED': 'true',
 'NEWSPIDER_MODULE': 'immobot.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['immobot.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_8 like Mac OS X) '
               'AppleWebKit/532.2 (KHTML, like Gecko) CriOS/60.0.882.0 '
               'Mobile/75M265 Safari/532.2'}
2022-07-03 16:42:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet Password: 8c08fc5f87b827be
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-07-03 16:42:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-03 16:42:35 [scrapy.core.engine] INFO: Spider opened
2022-07-03 16:42:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-03 16:42:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to acquire lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 acquired on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Attempting to release lock 4400266544 on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:35 [filelock] DEBUG: Lock 4400266544 released on /Users/artsyloaner/.cache/python-tldextract/3.10.5.final__3.10__22a438__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/robots.txt> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=3.0-&price=-2000.0&livingspace=70.0-&exclusioncriteria=swapflat&pricetype=rentpermonth&geocodes=110000000702,110000000801,110000000202,110000000104,110000000401,110000000701&enteredFrom=result_list> (referer: None)
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-03 16:42:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1301,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 17384,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.344543,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 3, 14, 42, 36, 190857),
 'log_count/DEBUG': 7,
 'log_count/INFO': 10,
 'memusage/max': 65961984,
 'memusage/startup': 65957888,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 7, 3, 14, 42, 35, 846314)}
2022-07-03 16:42:36 [scrapy.core.engine] INFO: Spider closed (finished)
There was a problem with reading a json formatted object
Traceback (most recent call last):
  File "/Users/artsyloaner/Downloads/immo-master/immobot/immo.py", line 17, in <module>
    data = json.load(data_file)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Everytime I see an empty array on the scrapped results. Do you know what might be the issue?

Thank you

nickirk commented 2 years ago

As you can see this project was developed several years ago, so if immobilienscout24 implemented some counter measures against bots, this could be the result. You may try wg-gesuchet.de.

If you really want to use it on immobilienscout24, you could try start scrapy alone in interactive mode and try to scrape the website you want to dig and see what returns. Just follow scrapy's introductions for beginners on how to use it. https://docs.scrapy.org/en/latest/intro/tutorial.html

fabikrah commented 2 years ago

@mathouthouthou did you manage to run the script correctly? I'm stuck at the same part

mathouthouthou commented 2 years ago

I gave up !

On Thu, 18 Aug 2022 at 00:13, fabikrah @.***> wrote:

@mathouthouthou https://github.com/mathouthouthou did you manage to run the script correctly? I'm stuck at the same part

— Reply to this email directly, view it on GitHub https://github.com/nickirk/immo/issues/14#issuecomment-1218541411, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ4W4WUCCV3SL7CU2OVGGCDVZVPXDANCNFSM52Q2GREA . You are receiving this because you were mentioned.Message ID: @.***>

fabikrah commented 2 years ago

That's a pity. I found this project here https://github.com/orangecoding/fredy, which works like a charm with scrapingant and immoscout24. But it only sends new flats based on search criterias to a telegram bot