tducret / amazon-scraper-python

Non-official client to get some info about products sold on Amazon
MIT License
871 stars 159 forks source link

Update extraction of title, price, average rating and number of ratings; other tweaks #27

Closed jpeacock29 closed 5 years ago

jpeacock29 commented 5 years ago

Currently, the CSS selectors don't work for title, price, average rating or number of ratings. I use regex to extract price and average rating, which might be more consistent than the CSS selectors. To handle multiple prices, I return the minimum non-zero price, but this could be modified readily.

I've also taken the liberty of breaking out the extraction of each element into it's own function. I simplified the logic of Client._get_products by having an early return after checking valid_page, thus the big diff there. I added some comments as well.

tducret commented 5 years ago

Thank you very much for this nice work Jacob @jpeacock29 . It seems that it gets harder and harder to avoid the captcha :( Would you have a clever idea to include a workaround in this PR?

jpeacock29 commented 5 years ago

Perhaps a user agent string can be automatically generated and prepended to _USER_AGENT_LIST? I'd be happy to open another PR for this as I have a few more commits to push to handle price scrapping for more complex scenarios.

jpeacock29 commented 5 years ago

I don't have a fix for the user agent issue besides just expanding the provided list and hoping for the best, or adding an option for the user to pass their own. I think the pull request fulfills its original purpose now though.

tducret commented 5 years ago

Thanks @jpeacock29 Sorry for the delay ;) I hope we'll find a way to counter the captcha protection.