Set Base URL from command line

sughodke commented 6 years ago

It would be handy to set the Amazon baseurl from commandline (or ENV).

Right now the scraper only looks up amazon.co.jp, it would need to be refactored from these files.

core_extract_comments.py 9:# https://www.amazon.co.jp/product-reviews/B00Z16VF3E/ref=cm_cr_arp_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=helpful&pageNumber=1 12: return 'https://www.amazon.co.jp/product-reviews/{}/ref=' \ 20: url = 'http://www.amazon.co.jp/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + \

core_generate_product_ids.py 25: main_category_page = get_soup('https://www.amazon.co.jp/gp/site-directory/ref=nav_shopall_btn')

core_utils.py 64: if 'amazon.co.jp' not in url: 65: url = 'https://www.amazon.co.jp' + url

Workaround:

Running the following command at the project directory will recursively replace amazon.co.jp to amazon.com.

find . -type f -exec sed -i 's/amazon.co.jp/amazon.com/g' {} +

philipperemy commented 6 years ago

@sughodke again happy to review any pull request :)

philipperemy commented 6 years ago

philipperemy / amazon-reviews-scraper