juliomalegria / python-craigslist

Simple Craigslist wrapper
MIT No Attribution
387 stars 117 forks source link

pages may have less than RESULTS_PER_REQUEST #26

Closed AlJohri closed 7 years ago

AlJohri commented 7 years ago
if (total_so_far - start) < RESULTS_PER_REQUEST:
    break

depending on the filters used, I'm seeing many pages have less than 100 results per page even if there are multiple pages. here is an example url:

https://washingtondc.craigslist.org/search/apa?search_distance=1&postal=20071&availabilityMode=0

Run document.querySelectorAll("#sortable-results .row").length == 51 despite the top showing 1 to 100 of 1177.

juliomalegria commented 7 years ago

You're right. I didn't realize this. I'm still thinking what would be a good way to fix it though ... any ideas?

gregv21v commented 7 years ago

You could scrape the page for that number. The selector would be span.rangeTo.

AlJohri commented 7 years ago

yeah what I do right now is:

def get_current_offset_from_response(doc):
    return int(doc.cssselect("#searchform span.pagenum span.rangeFrom")[0].text)

def get_number_of_posts_on_current_page_from_response(doc):
    return int(doc.cssselect("#searchform span.pagenum span.rangeTo")[0].text)

def get_num_total_posts_from_response(doc):
    return int(doc.cssselect("#searchform span.pagenum span.totalcount")[0].text)

doc = lxml.html.fromstring(requests.get(get_query_url(
        area, "search", offset=0, sort=sort, **kwargs)))
num_total_posts = get_num_total_posts_from_response(doc)
num_posts_on_page = get_number_of_posts_on_current_page_from_response(doc)
# yield posts on first page

for offset in range(100, num_total_posts, 100):
    doc = lxml.html.fromstring(requests.get(get_query_url(
            area, "search", offset=offset, sort=sort, **kwargs)))
    # yield posts on this page