Closed AlJohri closed 7 years ago
You're right. I didn't realize this. I'm still thinking what would be a good way to fix it though ... any ideas?
You could scrape the page for that number. The selector would be span.rangeTo.
yeah what I do right now is:
def get_current_offset_from_response(doc):
return int(doc.cssselect("#searchform span.pagenum span.rangeFrom")[0].text)
def get_number_of_posts_on_current_page_from_response(doc):
return int(doc.cssselect("#searchform span.pagenum span.rangeTo")[0].text)
def get_num_total_posts_from_response(doc):
return int(doc.cssselect("#searchform span.pagenum span.totalcount")[0].text)
doc = lxml.html.fromstring(requests.get(get_query_url(
area, "search", offset=0, sort=sort, **kwargs)))
num_total_posts = get_num_total_posts_from_response(doc)
num_posts_on_page = get_number_of_posts_on_current_page_from_response(doc)
# yield posts on first page
for offset in range(100, num_total_posts, 100):
doc = lxml.html.fromstring(requests.get(get_query_url(
area, "search", offset=offset, sort=sort, **kwargs)))
# yield posts on this page
depending on the filters used, I'm seeing many pages have less than 100 results per page even if there are multiple pages. here is an example url:
https://washingtondc.craigslist.org/search/apa?search_distance=1&postal=20071&availabilityMode=0
Run
document.querySelectorAll("#sortable-results .row").length == 51
despite the top showing1 to 100 of 1177
.