html scraping with grequests

Johnny-Courage020 commented 7 years ago

Hi, I'm very green when it comes to python (or programming in general), but I've written a script that parses about 16000 urls and extracts some values from each page's html.

I'm trying to do this asynchronously with grequests, but I'm having a hard time understanding how to get the acutal html code, simultaneously.

to do this synchronously I'm using the following command:

` for url in list_of_urls:

responses = requests.get(url)

html_tree = html.fromstring(responses.content)

name = html_tree.xpath('//span[@class = "header_name"]/text()', smart_strings=False)`

where 'name' is one of the things I'm extracting, there's more but I don't think you'd appreciate me blasting this issue with useless code. Needless to say the above code takes a fuckton of time to process 16000 urls :p

As far as my (very very limited) understanding goes, grequests does:

unsent_request = (grequests.get(url) for url in urls) creates a list of unsent requests and results = grequests.map(unsent_request) issues all of the requests at the same time and waits for all of them to complete.

How do I get an html_tree into Python, for parsing purposes, using grequests?

Thanks a million^million

platypus-supply commented 7 years ago

This isn't an issue?

Either use a call back:

def loaded(response, *args, **kwargs):
    html_tree = html.fromstring(response.content)
    #add your processing

unsent_request = (grequests.get(url, hooks={'response': Loaded'}) for url in urls)

or just iterate the responses:

for response in results:
    html_tree = html.fromstring(response.content)

I've not tested either of those but either should work.

Johnny-Courage020 commented 7 years ago

Thanks for the reply! And I'm sorry, no I guess it's not issue. Is there a different github functionality I should use when asking a question?

orsonadams commented 7 years ago

Hey @Johnny-Courage020 use stackoverflow for questions like this. Also consider closing this issue.

spyoungtech / grequests

html scraping with grequests #101