toby-p / rightmove_webscraper.py

Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object
MIT License
252 stars 112 forks source link

Alternative dataset - sold prices #25

Open osmya opened 4 years ago

osmya commented 4 years ago

Toby

Would be interesting to create an additional class to take data from previously sold https://www.rightmove.co.uk/house-prices/London-87490.html?soldIn=1&page=1

Thoughts?

alandinbedia commented 4 years ago

This would be so helpful!!!

toby-p commented 4 years ago

This is a nice idea - I didn't realize this data was available on the website. Will take a look when I get a chance, or feel free to submit a pull request if you want to have a go. I agree it should probably be a separate class or at least not have any impact on the current API.

alandinbedia commented 4 years ago

Thanks for the response to both. The easiest way to find the information is actually on the 'Market info' Tab. For example: I do my normal search for 'Sale' properties, then the property details have a few tabs, such as description, floorplan, map etc... the last one is market info which is from the land registry files so very useful to see what was the sold value for a house selling now.

Please keep me updated if any enhancements on this could be added. I am also asking a friend of mine (as I am not a developer) to help me to see if he can figure it out and will share if we manage.

Many thanks

oddkiva commented 4 years ago

Hello all, I am @alandinbedia 's friend.

I followed @osmya 's input initially to find the list of sold properties. I haven't formatted the history of transactions.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd

class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame

# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/house-prices/detail.html?country=england&locationIdentifier=REGION%5E70417&searchLocation=London+Fields&radius=0.25"

# 2. Launch the data scraping here.
sold_properties = SoldProperties(url)

# 3. Save the results somewhere.
sold_properties.table.to_csv('sold_properties.csv')
oddkiva commented 4 years ago

Since it was not really useful, I had a look at the other page and re-adapted the class easily to find out the list of properties for sale.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd

class PropertiesForSale:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.jsonModel = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        page = 2;
        last = int(page_data['pagination']['last'])
        chunk_size = int(page_data['pagination']['next'])

        # Scrape each page
        while True:
            next_index = (page - 1) * chunk_size
            if next_index > last:
                print('Finished!')
                break

            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&index={page * chunk_size}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

            # Go to the next page.
            page += 1

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame

# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E70417&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false"

# 2. Launch the data scraping here.
properties_for_sale = PropertiesForSale(url)

# 3. Save the results somewhere.
properties_for_sale.table.to_csv('properties_for_sale.csv')

HTH

toby-p commented 4 years ago

Thanks for doing this, will take a proper look at it when I get the time to add it to the package.

p2327 commented 4 years ago
p2327 commented 4 years ago

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

oddkiva commented 4 years ago

Glad to know the first one is useful.

The second class merely returns the list of properties not sold yet. True, it does not contain any market information (probaly the market history of the area).

On the other hand you can find complementary information (GPS location, size in sq feet, addedOrReduced, area in development, etc.). You have to reformat the data...

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

p2327 commented 4 years ago

@davidok8

check latest commit in this PR

You can access a processed df by invoking .processed_data on a SoldProperties object

Note that some pf the code is redundant - i will trim it later

changes to your class:

# imports
import ast
import re
import datetime as dt
from datetime import datetime
from lxml import html
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Env
address_pattern = r'([\s\S]+?)([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?\s[0-9]?[A-Za-z][A-Za-z])'
outwardcode_pattern = r'([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?)'

# Helpers
def extract_price(series):
    prices = []
    for entry in series:
        prices.append(int(entry[0]['displayPrice'].strip('£').replace(',', '')))
    return prices

def extract_date(series):
    dates = []
    for entry in series:
        dates.append(datetime.strptime(entry[0]['dateSold'], '%d %b %Y'))
    return dates

def extract_tenure(series):
    tenures = []
    for entry in series:
        tenures.append(entry[0]['tenure'])
    return tenures

def extract_coords(series, lat=False):
    coords = []
    if lat:
        for entry in series:
            coords.append(entry['lat'])
    else:
        for entry in series:
            coords.append(entry['lng'])
    return coords

class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        def process_data(rawdf):
            df = rawdf.copy()

            address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
            outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()

            df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                    .assign(address=address[:, 0])
                    .assign(postcode=address[:, 1])
                    .assign(outwardcode=outwardcodes[:, 0])
                    #.assign(transactions=df.transactions.apply(ast.literal_eval))
                    #.assign(location=df.location.apply(ast.literal_eval))
                    .assign(last_price=lambda x: extract_price(x.transactions))
                    .assign(sale_date=lambda x: extract_date(x.transactions))
                    .assign(tenure=lambda x: extract_tenure(x.transactions))
                    .assign(lat=lambda x: extract_coords(x.location, lat=True))
                    .assign(lng=lambda x: extract_coords(x.location))
                    .drop(['transactions', 'location'], axis=1)
            )
            return df

        #return process_data(property_data_frame)

        return property_data_frame

    @property
    def processed_data(self):
        df = self._results

        address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
        outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()

        df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                .assign(address=address[:, 0])
                .assign(postcode=address[:, 1])
                .assign(outwardcode=outwardcodes[:, 0])
                #.assign(transactions=df.transactions.apply(ast.literal_eval))
                #.assign(location=df.location.apply(ast.literal_eval))
                .assign(last_price=lambda x: extract_price(x.transactions))
                .assign(sale_date=lambda x: extract_date(x.transactions))
                .assign(tenure=lambda x: extract_tenure(x.transactions))
                .assign(lat=lambda x: extract_coords(x.location, lat=True))
                .assign(lng=lambda x: extract_coords(x.location))
                .drop(['transactions', 'location'], axis=1)
                .reindex(columns=['last_price', 
                                'sale_date', 
                                'propertyType',
                                'bedrooms',
                                'bathrooms', 
                                'tenure', 
                                'address', 
                                'postcode', 
                                'outwardcode', 
                                'lat', 
                                'lng'])
        )
        return df
andrewwilso commented 4 years ago

This is extremely useful. Is it possible to include the get_floorplans option as in the main class?