scrapfly / scrapfly-scrapers

Web scrapers for popular targets powered Scrapfly.io
https://scrapfly.io
Other
169 stars 46 forks source link

IndexError: list index out of range on bookingcom-scraper #4

Closed esaumell closed 9 months ago

esaumell commented 9 months ago

Scraper Which scraper is affected? bookingcom-scraper Environment Python 3.10.12 Scrapfly SDK version: Version: 0.8.8 Operating System: Ubuntu 22.04.3 LTS Describe the bug On a working environment suddenly we get an IndexError: list index out of range error. To reproduce:

$ git clone https://github.com/scrapfly/scrapfly-scrapers.git
$ cd scrapfly-scrapers/bookingcom-scraper
$ python3 run.py

Received Output

Traceback (most recent call last):
  File "/home/user/scrapfly-scrapers/bookingcom-scraper/run.py", line 46, in <module>
    asyncio.run(run())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/user/scrapfly-scrapers/bookingcom-scraper/run.py", line 28, in run
    result_search = await bookingcom.scrape_search(
  File "/home/user/scrapfly-scrapers/bookingcom-scraper/bookingcom.py", line 90, in scrape_search
    _total_results = int(first_page.selector.css("h1").re(r"([\d,]+) properties found")[0].replace(",", ""))
IndexError: list index out of range

Expected Output No errors Screenshots Not needed Additional context On September 21 it was working. It hasn't worked since then.

robbinonline commented 9 months ago

Same issue here

esaumell commented 9 months ago

From what I can see the problem is not with the offending line. The problem is with the URL. It gives no results, so there is no match within the regex, and then, the code complains about it with no index because we have a missing conditional here to cover this cases.

This output:

2023-10-05 00:42:29.108 | INFO     | bookingcom:scrape_search:72 - scraping search for Malta 2023-10-05-2023-10-12
2023-10-05 00:42:32.488 | DEBUG    | bookingcom:parse_search_page:41 - parsing search page: https://www.booking.com/index.html?label=gen173nr-1FCAQoggJCDHNlYXJjaF9tYWx0YUgzWARosAKIAQGYATG4ARjIAQzYAQHoAQH4AQOIAgGoAgS4AtbU96gGwAIB0gIkYmU3Yzk4NmMtMWIyNC00YWY2LTg1NWMtYzVhMmVkNjk5OTE42AIF4AIB&sid=ffea2ce6339ec3c5418636ac90057a05&srpvid=a3329fab25ff00f3&&errorc_searchstring_not_found=ss

Should be similar to this one:

2023-09-13 22:53:35.131 | INFO     | bookingcom:scrape_search:72 - scraping search for Malta 2023-09-13-2023-09-20
2023-09-13 22:53:38.864 | DEBUG    | bookingcom:parse_search_page:41 - parsing search page: https://www.booking.com/searchresults.html?ss=Malta&checkin_year=2023&checkin_month=09&checkin_monthday=13&checkout_year=2023&checkout_month=09&checkout_monthday=20&no_rooms=1&offset=0

Tell me if I'm wrong but this is stock code from GH and for me this is caused by some change on Scrapfly's API

Granitosaurus commented 9 months ago

Thanks for the detailed report @esaumell This seems to be caused by an update of how Booking is generating URLs for search. Now they require location id together with the search string and some url parameters have changed for checkin/checkout. So, the scraper couldn't find any results.

I've updated the search url generation and for the details see this commit: https://github.com/scrapfly/scrapfly-scrapers/commit/72def4300d21b1eb8128aa76899d3e1e2b822b9a

Cheers!

esaumell commented 9 months ago

That last sentence of my last post should have ended like ...or booking.com's code Thank you so much @Granitosaurus

Best!