jgalar / CanadianTracker

Canadian Tracker scrapes the Canadian Tire website to provide a price history.
7 stars 1 forks source link

Improve resiliency to connection errors #58

Open jgalar opened 1 year ago

jgalar commented 1 year ago

Lately, price scraping jobs have failed to complete because of various connection errors; notably 502 errors.

Feb 26 09:21:35 vps-c0ce24d7 start.sh[14631]: ERROR:canadiantracker.triangle:Got status code 502 on try 3
Feb 26 09:21:40 vps-c0ce24d7 start.sh[14631]: DEBUG:canadiantracker.triangle:requested 50 product infos
Feb 26 09:21:40 vps-c0ce24d7 start.sh[14631]: DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apim.canadiantire.ca:443
Feb 26 09:21:41 vps-c0ce24d7 start.sh[14631]: DEBUG:urllib3.connectionpool:https://apim.canadiantire.ca:443 "POST /v1/product/api/v1/product/sku/PriceAvailability/?lang=en_CA&storeId=64 HTTP/1.1" 502 375
Feb 26 09:21:41 vps-c0ce24d7 start.sh[14631]: ERROR:canadiantracker.triangle:Got status code 502 on try 4
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]: Traceback with variables (most recent call last):
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "<string>", line 1, in <module>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ...skipped... 9 vars
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     return self.main(*args, **kwargs)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       args = ()
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       kwargs = {}
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     rv = self.invoke(ctx)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       args = ['scrape-prices', '--db-path', '/home/scraper/db.tmp.gh6iBe/inventory.db']
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       prog_name = 'ctscraper'
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       complete_var = None
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       standalone_mode = True
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       windows_expand_args = True
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       extra = {}
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7fb4cd39b820>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       _process_result = <function MultiCommand.invoke.<locals>._process_result at 0x7fb4cd4fbf40>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       args = []
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       cmd_name = 'scrape-prices'
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       cmd = <Command scrape-prices>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       sub_ctx = <click.core.Context object at 0x7fb4cba35ba0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7fb4cd39b820>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       __class__ = <class 'click.core.MultiCommand'>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     return ctx.invoke(self.callback, **ctx.params)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <Command scrape-prices>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7fb4cba35ba0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     return __callback(*args, **kwargs)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       _Context__self = <click.core.Context object at 0x7fb4cba35ba0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       _Context__callback = <function scrape_prices at 0x7fb4cb3fe3b0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       args = ()
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       kwargs = {'db_path': '/home/scraper/db.tmp.gh6iBe/inventory.db', 'older_than': 1, 'discard_equal': True}
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7fb4cba35ba0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/scraper.py", line 246, in scrape_prices
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     repository.add_product_price_samples(ledger, discard_equal)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       db_path = '/home/scraper/db.tmp.gh6iBe/inventory.db'
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       older_than = 1
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       discard_equal = True
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       repository = <canadiantracker.storage.ProductRepository object at 0x7fb4cb5d5c00>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       progress_bar_settings = {'label': 'Scraping prices', 'show_pos': True, 'item_show_func': <function scrape_prices.<locals>.<lambda> at 0x7fb4cb4c4790>, 'bar_template': ''}
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       skus = <click._termui_impl.ProgressBar object at 0x7fb4cb433370>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ledger = <canadiantracker.triangle.ProductLedger object at 0x7fb4cb4324d0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/storage.py", line 308, in add_product_price_samples
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     for info in product_infos:
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <canadiantracker.storage.ProductRepository object at 0x7fb4cb5d5c00>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       product_infos = <canadiantracker.triangle.ProductLedger object at 0x7fb4cb4324d0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       discard_equal = True
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       info = {'_raw_payload': {'code': '4084018', 'active': True, 'sellable': True, 'orderable': False, 'originalPrice': None, 'currentPrice': {'value': Decimal('292.99')}, 'displayWasLabel': False, 'badges': [], 'storeShelfLocation': None, 'fulfillment': {'availability': {'Corporate': {'MinOrderQty': 1, 'bopisETA': {'MinETA': '2023-02-27T00:00:00.000Z', 'MaxETA': '2023-03-03T00:00:00.000Z'}, 'sthETA': {'MinETA': '2023-03-01T00:00:00.000Z', 'MaxETA': '2023-03-06T00:00:00.000Z'}}, 'quantity': 0}, 'storePickUp': {'etaEarliest': None, 'enabled': True}, 'shipToHome': {'etaEarliest': None, 'etaLatest': None, 'enabled': False}, 'expressDelivery': {'enabled': False, 'orderIn': None, 'etaEarliest': None}}, 'partNumber': '171149010', 'feeValue': 3, 'priceMessage': [{'label': None, 'tooltip': None}], 'rebate': None, 'priceValidUntil': None, 'warrantyMessage': 'Passenger and light truck tires purchased, installed and balanced at a Canadian Tire Associate Store are covered by a pro-rated Road Hazard Damage and...
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       price = Decimal('292.99')
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       sku = _StorageSku(code=4084018, formatted_code=408-4018-8, product_index=2)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       last_sample = _StorageProductSample(index=44312685, sample_time=2023-02-24 08:17:15.719178, sku_index=44, price_cents=29299)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       equal = True
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       new_sample = _StorageProductSample(index=None, sample_time=2023-02-26 09:21:16.883392, sku_index=None, price_cents=29299)
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/triangle.py", line 338, in __iter__
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     for product_info in self._get_product_infos(batch):
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       self = <canadiantracker.triangle.ProductLedger object at 0x7fb4cb4324d0>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       batch = [_StorageSku(code=4084010, formatted_code=408-4010-4, product_index=2), _StorageSku(code=0081170, formatted_code=008-1170-2, product_index=2), _StorageSku(code=0072349, formatted_code=007-2349-6, product_index=2), _StorageSku(code=0062186, formatted_code=006-2186-4, product_index=2), _StorageSku(code=4084008, formatted_code=408-4008-2, product_index=2), _StorageSku(code=4084009, formatted_code=408-4009-0, product_index=2), _StorageSku(code=4084024, formatted_code=408-4024-2, product_index=2), _StorageSku(code=4084025, formatted_code=408-4025-0, product_index=2), _StorageSku(code=4084022, formatted_code=408-4022-6, product_index=2), _StorageSku(code=4084023, formatted_code=408-4023-4, product_index=2), _StorageSku(code=4089357, formatted_code=408-9357-0, product_index=2), _StorageSku(code=4084028, formatted_code=408-4028-4, product_index=2), _StorageSku(code=4084029, formatted_code=408-4029-2, product_index=2), _StorageSku(code=0081167, formatted_code=008-1167-2, product_index=2), _Stor...
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       product_info = {'_raw_payload': {'code': '4084018', 'active': True, 'sellable': True, 'orderable': False, 'originalPrice': None, 'currentPrice': {'value': Decimal('292.99')}, 'displayWasLabel': False, 'badges': [], 'storeShelfLocation': None, 'fulfillment': {'availability': {'Corporate': {'MinOrderQty': 1, 'bopisETA': {'MinETA': '2023-02-27T00:00:00.000Z', 'MaxETA': '2023-03-03T00:00:00.000Z'}, 'sthETA': {'MinETA': '2023-03-01T00:00:00.000Z', 'MaxETA': '2023-03-06T00:00:00.000Z'}}, 'quantity': 0}, 'storePickUp': {'etaEarliest': None, 'enabled': True}, 'shipToHome': {'etaEarliest': None, 'etaLatest': None, 'enabled': False}, 'expressDelivery': {'enabled': False, 'orderIn': None, 'etaEarliest': None}}, 'partNumber': '171149010', 'feeValue': 3, 'priceMessage': [{'label': None, 'tooltip': None}], 'rebate': None, 'priceValidUntil': None, 'warrantyMessage': 'Passenger and light truck tires purchased, installed and balanced at a Canadian Tire Associate Store are covered by a pro-rated Road Hazard Damage and...
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/triangle.py", line 333, in _get_product_infos
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:     raise RuntimeError("Failed to get product info")
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       skus = [_StorageSku(code=4084010, formatted_code=408-4010-4, product_index=2), _StorageSku(code=0081170, formatted_code=008-1170-2, product_index=2), _StorageSku(code=0072349, formatted_code=007-2349-6, product_index=2), _StorageSku(code=0062186, formatted_code=006-2186-4, product_index=2), _StorageSku(code=4084008, formatted_code=408-4008-2, product_index=2), _StorageSku(code=4084009, formatted_code=408-4009-0, product_index=2), _StorageSku(code=4084024, formatted_code=408-4024-2, product_index=2), _StorageSku(code=4084025, formatted_code=408-4025-0, product_index=2), _StorageSku(code=4084022, formatted_code=408-4022-6, product_index=2), _StorageSku(code=4084023, formatted_code=408-4023-4, product_index=2), _StorageSku(code=4089357, formatted_code=408-9357-0, product_index=2), _StorageSku(code=4084028, formatted_code=408-4028-4, product_index=2), _StorageSku(code=4084029, formatted_code=408-4029-2, product_index=2), _StorageSku(code=0081167, formatted_code=008-1167-2, product_index=2), _Stor...
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       ntry = 4
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       url = 'https://apim.canadiantire.ca/v1/product/api/v1/product/sku/PriceAvailability/?lang=en_CA&storeId=64'
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       headers = {'authority': 'apim.canadiantire.ca', 'accept': 'application/json, text/plain, */*', 'accept-language': 'en-US,en;q=0.9', 'bannerid': 'CTR', 'basesiteid': 'CTR', 'ocp-apim-subscription-key': 'c01ef3612328420c9f5cd9277e815a0e', 'origin': 'https://www.canadiantire.ca', 'referer': 'https://www.canadiantire.ca/', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-site', 'service-client': 'ctr/web', 'service-version': 'ctc-dev2', 'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:110.0) Gecko/20100101 Firefox/110.0', 'x-web-host': 'www.canadiantire.ca', 'cache-control': 'no-cache', 'pragma': 'no-cache', 'content-type': 'application/json'}
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       body = {'skus': [{'code': '4084010', 'lowStockThreshold': 0}, {'code': '0081170', 'lowStockThreshold': 0}, {'code': '0072349', 'lowStockThreshold': 0}, {'code': '0062186', 'lowStockThreshold': 0}, {'code': '4084008', 'lowStockThreshold': 0}, {'code': '4084009', 'lowStockThreshold': 0}, {'code': '4084024', 'lowStockThreshold': 0}, {'code': '4084025', 'lowStockThreshold': 0}, {'code': '4084022', 'lowStockThreshold': 0}, {'code': '4084023', 'lowStockThreshold': 0}, {'code': '4089357', 'lowStockThreshold': 0}, {'code': '4084028', 'lowStockThreshold': 0}, {'code': '4084029', 'lowStockThreshold': 0}, {'code': '0081167', 'lowStockThreshold': 0}, {'code': '0062806', 'lowStockThreshold': 0}, {'code': '4084026', 'lowStockThreshold': 0}, {'code': '4084027', 'lowStockThreshold': 0}, {'code': '0073667', 'lowStockThreshold': 0}, {'code': '4089356', 'lowStockThreshold': 0}, {'code': '4084020', 'lowStockThreshold': 0}, {'code': '4086683', 'lowStockThreshold': 0}, {'code': '4084021', 'lowStockThreshold': 0}, ...
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]:       response = <Response [502]>
Feb 26 09:21:46 vps-c0ce24d7 start.sh[14631]: builtins.RuntimeError: Failed to get product info
Feb 26 09:21:47 vps-c0ce24d7 start.sh[14631]: Failed to run scrape-prices, aborting job.

SKU scrapings also fail for similar reasons.

Feb 26 09:20:56 vps-c0ce24d7 start.sh[14631]: DEBUG:canadiantracker.storage:  SKU 3999158 is already present
Feb 26 09:20:56 vps-c0ce24d7 start.sh[14631]: DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apim.canadiantire.ca:443
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]: Traceback with variables (most recent call last):
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "<string>", line 1, in <module>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ...skipped... 9 vars
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return self.main(*args, **kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       args = ()
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       kwargs = {}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     rv = self.invoke(ctx)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       args = ['scrape-skus', '--db-path', '/home/scraper/db.tmp.gh6iBe/inventory.db']
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       prog_name = 'ctscraper'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       complete_var = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       standalone_mode = True
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       windows_expand_args = True
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       extra = {}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7ff572ef7820>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       _process_result = <function MultiCommand.invoke.<locals>._process_result at 0x7ff573057f40>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       args = []
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       cmd_name = 'scrape-skus'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       cmd = <Command scrape-skus>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       sub_ctx = <click.core.Context object at 0x7ff5715a1ba0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7ff572ef7820>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <Group cli>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       __class__ = <class 'click.core.MultiCommand'>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return ctx.invoke(self.callback, **ctx.params)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <Command scrape-skus>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7ff5715a1ba0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return __callback(*args, **kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       _Context__self = <click.core.Context object at 0x7ff5715a1ba0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       _Context__callback = <function scrape_skus at 0x7ff570f62200>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       args = ()
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       kwargs = {'db_path': '/home/scraper/db.tmp.gh6iBe/inventory.db', 'products': None}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ctx = <click.core.Context object at 0x7ff5715a1ba0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/scraper.py", line 188, in scrape_skus
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     for sku in triangle.SkusInventory(product):
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       db_path = '/home/scraper/db.tmp.gh6iBe/inventory.db'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       products = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       repository = <canadiantracker.storage.ProductRepository object at 0x7ff57113dab0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       products_list = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       progress_bar_settings = {'label': 'Scraping SKUs', 'show_pos': True, 'item_show_func': <function scrape_skus.<locals>.<lambda> at 0x7ff571028790>, 'length': 134849, 'bar_template': ''}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       products_wrapper = <click._termui_impl.ProgressBar object at 0x7ff570f97190>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       i = 124788
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       product = <canadiantracker.storage._StorageProduct object at 0x7ff561006c20>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       sku = <canadiantracker.model.Sku object at 0x7ff5604539d0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/triangle.py", line 256, in __iter__
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     resp = SkusInventory._request_page(self._product.code)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <canadiantracker.triangle.SkusInventory object at 0x7ff5604537f0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       ntry = 0
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/CanadianTracker/src/canadiantracker/triangle.py", line 249, in _request_page
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return requests.get(
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       product_code = '7745620P'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       headers = {'authority': 'apim.canadiantire.ca', 'accept': 'application/json, text/plain, */*', 'accept-language': 'en-US,en;q=0.9', 'bannerid': 'CTR', 'basesiteid': 'CTR', 'ocp-apim-subscription-key': 'c01ef3612328420c9f5cd9277e815a0e', 'origin': 'https://www.canadiantire.ca', 'referer': 'https://www.canadiantire.ca/', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-site', 'service-client': 'ctr/web', 'service-version': 'ctc-dev2', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56', 'x-web-host': 'www.canadiantire.ca', 'cache-control': 'no-cache', 'pragma': 'no-cache'}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/requests/api.py", line 73, in get
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return request("get", url, params=params, **kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       url = 'https://apim.canadiantire.ca/v1/product/api/v1/product/productFamily/7745620P?baseStoreId=CTR&lang=en_CA&storeId=64'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       params = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       kwargs = {'headers': {'authority': 'apim.canadiantire.ca', 'accept': 'application/json, text/plain, */*', 'accept-language': 'en-US,en;q=0.9', 'bannerid': 'CTR', 'basesiteid': 'CTR', 'ocp-apim-subscription-key': 'c01ef3612328420c9f5cd9277e815a0e', 'origin': 'https://www.canadiantire.ca', 'referer': 'https://www.canadiantire.ca/', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-site', 'service-client': 'ctr/web', 'service-version': 'ctc-dev2', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56', 'x-web-host': 'www.canadiantire.ca', 'cache-control': 'no-cache', 'pragma': 'no-cache'}}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/requests/api.py", line 59, in request
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     return session.request(method=method, url=url, **kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       method = 'get'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       url = 'https://apim.canadiantire.ca/v1/product/api/v1/product/productFamily/7745620P?baseStoreId=CTR&lang=en_CA&storeId=64'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       kwargs = {'params': None, 'headers': {'authority': 'apim.canadiantire.ca', 'accept': 'application/json, text/plain, */*', 'accept-language': 'en-US,en;q=0.9', 'bannerid': 'CTR', 'basesiteid': 'CTR', 'ocp-apim-subscription-key': 'c01ef3612328420c9f5cd9277e815a0e', 'origin': 'https://www.canadiantire.ca', 'referer': 'https://www.canadiantire.ca/', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-site', 'service-client': 'ctr/web', 'service-version': 'ctc-dev2', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56', 'x-web-host': 'www.canadiantire.ca', 'cache-control': 'no-cache', 'pragma': 'no-cache'}}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       session = <requests.sessions.Session object at 0x7ff560287700>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     resp = self.send(prep, **send_kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <requests.sessions.Session object at 0x7ff560287700>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       method = 'get'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       url = 'https://apim.canadiantire.ca/v1/product/api/v1/product/productFamily/7745620P?baseStoreId=CTR&lang=en_CA&storeId=64'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       params = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       data = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       headers = {'authority': 'apim.canadiantire.ca', 'accept': 'application/json, text/plain, */*', 'accept-language': 'en-US,en;q=0.9', 'bannerid': 'CTR', 'basesiteid': 'CTR', 'ocp-apim-subscription-key': 'c01ef3612328420c9f5cd9277e815a0e', 'origin': 'https://www.canadiantire.ca', 'referer': 'https://www.canadiantire.ca/', 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-site', 'service-client': 'ctr/web', 'service-version': 'ctc-dev2', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56', 'x-web-host': 'www.canadiantire.ca', 'cache-control': 'no-cache', 'pragma': 'no-cache'}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       cookies = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       files = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       auth = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       timeout = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       allow_redirects = True
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       proxies = {}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       hooks = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       stream = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       verify = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       cert = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       json = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       req = <Request [GET]>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       prep = <PreparedRequest [GET]>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       settings = {'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       send_kwargs = {'timeout': None, 'allow_redirects': True, 'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     r = adapter.send(request, **kwargs)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <requests.sessions.Session object at 0x7ff560287700>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       request = <PreparedRequest [GET]>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       kwargs = {'timeout': None, 'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       allow_redirects = True
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       stream = False
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       hooks = {'response': []}
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       adapter = <requests.adapters.HTTPAdapter object at 0x7ff560286f20>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       start = 1677403256.396566
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:   File "/home/scraper/.cache/pypoetry/virtualenvs/canadiantracker-lR2ht7gH-py3.10/lib/python3.10/site-packages/requests/adapters.py", line 565, in send
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:     raise ConnectionError(e, request=request)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       self = <requests.adapters.HTTPAdapter object at 0x7ff560286f20>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       request = <PreparedRequest [GET]>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       stream = False
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       timeout = Timeout(connect=None, read=None, total=None)
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       verify = True
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       cert = None
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       proxies = OrderedDict()
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       conn = <urllib3.connectionpool.HTTPSConnectionPool object at 0x7ff5602874c0>
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       url = '/v1/product/api/v1/product/productFamily/7745620P?baseStoreId=CTR&lang=en_CA&storeId=64'
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]:       chunked = False
Feb 26 09:21:06 vps-c0ce24d7 start.sh[14631]: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='apim.canadiantire.ca', port=443): Max retries exceeded with url: /v1/product/api/v1/product/productFamily/7745620P?baseStoreId=CTR&lang=en_CA&storeId=64 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff5602874f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
Feb 26 09:21:07 vps-c0ce24d7 start.sh[14631]: Failed to run scrape-skus, continuing...

I'm wondering if we want to make scrapings "resumable". Either we save enough context to take up where we left off or we attempt to only "refresh" the SKUs/categories that weren't scraped for a long while.

Otherwise we can do something simpler and just retry with some kind of exponential back-off until we see the service is back online.

simark commented 1 year ago

I'm wondering if we want to make scrapings "resumable". Either we save enough context to take up where we left off or we attempt to only "refresh" the SKUs/categories that weren't scraped for a long while.

I think the second option is easier to implement. Keep the "last refresh date / time" information, and start with those that haven't been successfully refreshed for the longest time. It would be nice to have regardless.

Otherwise we can do something simpler and just retry with some kind of exponential back-off until we see the service is back online.

We have to give up at some point though. If for instance our requests need to be updated, it may never start working again. In that case, we would want to stop the whole run.

I'm also wondering if some specific product or SKU is causing a server-side error. In that case, it may never start working, even it we wait. In that case we would want to skip over it and continue the run.

Whatever solution I can think of, I can also think of a pathological case that will make that solution not ideal. I think we'll need to experiment and see what works best. Maybe something like "if we see that 5 calls (e.g. to SkusInventory.iter) in a row have failed, then abort the run, because something appears to be seriously off". Combined with a way to find out if some specific products or SKUs appears to always cause a failure.

simark commented 1 year ago

Regarding the second error you pasted:

Failed to establish a new connection: [Errno -2] Name or service not known'))

That does sound more like a client problem than a server problem. Not sure why it would happen, but it looks different than the first one.

simark commented 1 year ago

I tried a scrape-prices, I also see a 502, so it's probably a systematic issue.

Could you consider reviewing and merging PR #55 before we look at this? Getting type check to work will make every subsequent change easier.