philipperemy / amazon-reviews-scraper

Yet another multi language scraper for Amazon targeting reviews.
Apache License 2.0
120 stars 42 forks source link

Set Base URL from command line #1

Closed sughodke closed 6 years ago

sughodke commented 6 years ago

It would be handy to set the Amazon baseurl from commandline (or ENV).

Right now the scraper only looks up amazon.co.jp, it would need to be refactored from these files.

core_extract_comments.py 9:# https://www.amazon.co.jp/product-reviews/B00Z16VF3E/ref=cm_cr_arp_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=helpful&pageNumber=1 12: return 'https://www.amazon.co.jp/product-reviews/{}/ref=' \ 20: url = 'http://www.amazon.co.jp/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + \

core_generate_product_ids.py 25: main_category_page = get_soup('https://www.amazon.co.jp/gp/site-directory/ref=nav_shopall_btn')

core_utils.py 64: if 'amazon.co.jp' not in url: 65: url = 'https://www.amazon.co.jp' + url


Workaround:

Running the following command at the project directory will recursively replace amazon.co.jp to amazon.com.

find . -type f -exec sed -i 's/amazon.co.jp/amazon.com/g' {} +
philipperemy commented 6 years ago

@sughodke again happy to review any pull request :)

philipperemy commented 6 years ago

https://github.com/philipperemy/amazon-reviews-scraper/commit/1b79bf92cd847cef86c978e2063a311bb6f02bd8 Fixed in