Closed HarvsG closed 4 years ago
Scrapes the floorplan URL - I then use this for OCR to extract information
#get floorplan from property urls
floorplan_urls = []
for weblink in weblinks:
rc = self.make_request(weblink)
tree = html.fromstring(rc[0])
xp_floorplan_url = """//*[@id="floorplanTabs"]/div[2]/div[2]/img/@src"""
floorplan_url = tree.xpath(xp_floorplan_url)
if floorplan_url == []:
floorplan_urls.append(np.nan)
else:
floorplan_urls.append(floorplan_url[0])
# Store the data in a Pandas DataFrame:
data = [price_pcm, titles, addresses, weblinks, agent_urls, floorplan_urls]
I was getting some errors about things not being string type when doing very large scrapes. So I added the astype(str)
to make the code more error resistant
# Extract postcodes to a separate column:
pat = r"\b([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?)\b"
results["postcode"] = results["address"].astype(str).str.extract(pat, expand=True)
# Extract number of bedrooms from "type" to a separate column:
pat = r"\b([\d][\d]?)\b"
results["number_bedrooms"] = results.type.astype(str).str.extract(pat, expand=True)
results.loc[results["type"].astype(str).str.contains("studio", case=False), "number_bedrooms"] = 0
I was getting some error that traced back here for reasons I couldn't explain so I butchered in a try: except block:
try:
return int(tree.xpath(xpath)[0].replace(",", ""))
except:
print('error extracting the result count header')
return 1050
This looks cool thanks, I'll try to get this working in my version.
@HarvsG I've included the floorplan scraping functionality in the new version just published, you just have to pass get_floorplans=True
when instantiating the object. I've found that doing this drastically increases the runtime, I guess as it's making so many requests, so have set the default to not do this.
I have made some changes on a fork. Unfortunately as I did a lot of renaming it wouldn't be possible for me to put them in a pull request.
Mostly I butchered in functionality to scrape floor plan URLs and changed some of the pandas processing to cast .astype(str) to prevent some errors I was getting. I wanted to share these changes with you in case they were of use.