juliomalegria / python-craigslist

Simple Craigslist wrapper
MIT No Attribution
387 stars 117 forks source link

Include mapaddress text #75

Closed BigFav closed 4 years ago

BigFav commented 4 years ago

Geocodes appear to be approximates in a lot of cases. In certain instances, the listing will have the address in the mapaddress div tag and it doesn't seem like this reads that. This is a feature request to include that information.

Example page - https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html Example return:

{
  'id': '7104679926',
  'repost_of': None,
  'name': 'Room for Rent',
  'url': 'https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html',
  'datetime': '2020-05-07 08:47',
  'last_updated': '2020-05-07 08:47',
  'price': '$669',
  'where': None,
  'has_image': True,
  'geotag': (39.955682, -76.718121)
}

Page with inspection open: Image with inspection

irahorecka commented 4 years ago

Unfortunately, I believe this is a craigslist.org problem.

On May 29, 2020, at 9:59 PM, Favian Contreras notifications@github.com wrote:

Geocodes appear to be approximates in a lot of cases. In certain instances, the listing will have the address in the mapaddress div tag and it doesn't seem like this reads that. This is a feature request to include that information.

Example page - https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html Example return:

{ 'id': '7104679926', 'repost_of': None, 'name': 'Room for Rent', 'url': 'https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html', 'datetime': '2020-05-07 08:47', 'last_updated': '2020-05-07 08:47', 'price': '$669', 'where': None, 'has_image': True, 'geotag': (39.955682, -76.718121) } Page with inspection open: https://user-images.githubusercontent.com/2354042/83319871-f274f780-a1f6-11ea-857b-5c4c41f28e8b.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/juliomalegria/python-craigslist/issues/75, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT4J535KSZHHK5LXEORDE3RUCHBTANCNFSM4NORCK5A.

BigFav commented 4 years ago

This is just a scraper, which is why I identified the tag that displays this. This tag is in the output of the following code (which is found in this package):

import requests
from bs4 import BeautifulSoup

response = requests.get('http://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html')
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

Unclear how this is a craigslist problem given it is output from the craigslist request call.

irahorecka commented 4 years ago

This is not this library's problem as it is Craigslist's. Check out the html source for the link you provided (btw the link will expire eventually).

html source

EDIT: Ok I think I misinterpreted your question. Do you want a library extension that gets the geotag of the address listed in the post?

BigFav commented 4 years ago

As stated above, this issue is to include the mapaddress div tag in the result, which in this case is "239 E Boundary Ave". Please look at what is highlighted in the original image on the inspection frame for reference.

juliomalegria commented 4 years ago

Hey @BigFav, I think this would be a great addition. Thanks for the feature request! Seems pretty simple to do, I'll work on it today or tomorrow. However, just FYI, this won't get the lat,lng from that address. The geotag will remain the same, from the approx info provided by Craigslist, however there'll be a new field "address" if the mapaddress is present. Your example return was a bit confusing, but it would be something like this:

{
  'id': '7104679926',
  'repost_of': None,
  'name': 'Room for Rent',
  'url': 'https://york.craigslist.org/apa/d/york-room-for-rent/7104679926.html',
  'datetime': '2020-05-07 08:47',
  'last_updated': '2020-05-07 08:47',
  'price': '$669',
  'where': None,
  'has_image': True,
  'geotag': (39.955682, -76.718121),
  'address': '239 E Boundary Ave'.         <<<< NEW
}

Thanks @irahorecka for stepping in to help 👍

BigFav commented 4 years ago

Definitely doesn't map sense to be used as geotag, was just noting it is more accurate than geotag in some cases. Where does where come from? From my naive perspective, it seems like it could go there.

juliomalegria commented 4 years ago

I had forgotten about where. This comes from result-hood in the list results. In the US, this always seems to be only the neighborhood, for e.g.

Screen Shot 2020-06-03 at 9 47 40 PM

However in other countries, result-hood seems to contain the full address:

Screen Shot 2020-06-03 at 9 49 32 PM

And in others (for example here in Berlin) is a mix of both: only the city/neighborhood, or the full address:

Screen Shot 2020-06-03 at 9 51 32 PM

I'm hesitating whether to just override where with the value of mapaddress if it exists, or to have 2 separate fields. I guess option C would be to concatenate them. For example in:

https://sfbay.craigslist.org/pen/apa/d/san-bruno-updated-1br-1ba-with-parking/7134532065.html

mapaddress is "443 San Anselmo Ave N near Mastick and Martin" and where is "san bruno". The resulting where could be: "443 San Anselmo Ave N near Mastick and Martin, San Bruno".

Any thoughts?

BigFav commented 4 years ago

Two separate fields is the simplest to me as it puts the power in the hands of the user; this tool is simply a scraper. My question was more for information purposes, apologies for the confusion.

KeeonTabrizi commented 4 years ago

@juliomalegria wondering if there's any progress on implementing this new extracted object. I'm considering doing something similar but if the work is already in flight think it makes sense to check in.

This is some very hacky code that works for me. I tried to more elegantly extract the text from the BS4 object but it was causing some strange issues for me. This brute force extract / string replacement works:

        map_address = soup.find_all('div', {'class': 'mapaddress'})
        map_address = str(map_address).replace('[<div class="mapaddress">','').replace('</div>]', '')
        if map_address:
            result['map_address'] = map_address

That block can be added/piggyback into the include_details call. So the code could go after: https://github.com/juliomalegria/python-craigslist/blob/c4649ab5026affd044280be77380cf9e063b1b83/craigslist/base.py#L315

Alternatively, you could use xpaths + lxml against the target //section/div[1]/div/div[2] which would give the same thing.

Likely some try/except blocks would be needed in case something changed from the html rendering on CL.

juliomalegria commented 4 years ago

Key @KeeonTabrizi, I'm sorry. This change is very much in flight. As a matter of fact, it was ready, I had just forgotten to push it. I'll do it now and release a new version.

juliomalegria commented 4 years ago

This was resolved in commit 1f8c44c. I've just released version 1.0.9 in PyPI