toby-p / rightmove_webscraper.py

Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object
MIT License
252 stars 112 forks source link

More rows returned than exist via the website. #41

Closed lkdmid closed 2 years ago

lkdmid commented 2 years ago

Steps to reproduce:

  1. Use the RightMove website to create your desired URL. At the time of posting, this example returns 51 results:

https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=USERDEFINEDAREA%5E%7B%22polylines%22%3A%22ewwgIn%60~VwehBn%7CFiv%7D%40oyr%40zej%40popEcbbAlq%7DCi__Cuy%7CCuccCmtvKx%7DoEw%7BsQnjmMiljJlk%7DJbkeDjbzGzren%40kxyUcyzCm%60yCup~H%22%7D&minBedrooms=3&maxPrice=399000&propertyTypes=detached&secondaryDisplayPropertyType=detachedshouses&maxDaysSinceAdded=1&mustHave=garden&dontShow=newHome%2Cretirement%2CsharedOwnership&furnishTypes=&keywords=

  1. Use above URL in your Python, e.g: results = RightmoveData(url).get_results

  2. Count the rows, in this case the result is 55 (still 51 via website, no cache). As you can see, there are 4 "extra" rows: len(results.index)

I've reproduced this using multiple random URLs, even in the dead of night, and always end up with a handful of "extra" rows compared to the website. Any ideas please?

rorti33 commented 2 years ago

It's likely to be because each page in Rightmove's results starts with a featured property. Try dropping duplicates before getting the number of rows.

lkdmid commented 2 years ago

Derp! That was it. Thanks! There were actually more duplicates than I even realised, I guess premium listings are also repeated on multiple pages.

Solution:

results = RightmoveData(url).get_results
results.drop_duplicates(subset="url", inplace=True)