scrapinghub / python-scrapinghub

A client interface for Scrapinghub's API

https://python-scrapinghub.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

202 stars 63 forks source link

implement new iter_by_chunks() in items #133

Closed BurnzZ closed 4 years ago

BurnzZ commented 4 years ago

Provides convenient wrapper for #121

BurnzZ commented 4 years ago

Hi @vshlapakov, I was wondering if there was a guide on adding new the vcr cassettes for this repo? The missing cassettes for this new method is what's causing the test failures. Thanks!

BurnzZ commented 4 years ago

Hi @vshlapakov, thanks for linking out #112. I've tried it out and works well on my local machine. However, when it's being run on CI it's going to fail. Mostly because when we empty out the env vars HS_ENDPOINT and DASH_ENDPOINT, it's going to use the live endpoints: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L68

When running in CI, these aren't emptied out and thus will use the default endpoints like: https://github.com/scrapinghub/python-scrapinghub/blob/master/tests/conftest.py#L15.

I'm not sure how to proceed with this as it may entail updating how the conftest.py works.

vshlapakov commented 4 years ago

@BurnzZ Thanks for checking! I see what you mean, it seems it doesn't work reliably :confused: I'm in lack of time right now, will check it in detail next week and update in the thread. And Valeriy is right, we still need to support count/offset logic, please take a look when you have time.

vshlapakov commented 4 years ago

@BurnzZ Sorry that the issue took so long, I finally found some time to go quietly through the problem and have something to share.

The case is pretty tricky. You correctly created the cassettes, and the normalizing process works fine. The real problem is that the normalization covers only requests data, not responses. So, it correctly forms the 1st run-spider request, VCR.py finds the related snapshot and provides the saved response, but the response has a real job key with a real project ID as a part of its key. As a result, the subsequent requests done using the related job instance have different paths and can't be matched with the saved snapshots which leads to the tests failure.

I made an attempt on top of your changes to fix the specific case by normalizing the job instances in tests, it works fine, feel free to use it to finish your changes. I have some doubts if this is the best approach to solve the puzzle, but that's clearly a completely different topic which will be addressed later.

codecov[bot] commented 4 years ago

Codecov Report

Merging #133 into master will increase coverage by 0.36%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   93.59%   93.96%   +0.36%     
==========================================
  Files          28       28              
  Lines        1921     1938      +17     
==========================================
+ Hits         1798     1821      +23     
+ Misses        123      117       -6

Impacted Files	Coverage Δ
scrapinghub/client/items.py	`100% <100%> (ø)`	:arrow_up:
scrapinghub/hubstorage/batchuploader.py	`94.15% <0%> (+3.5%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 92ff096...9a67373. Read the comment docs.

BurnzZ commented 4 years ago

Thanks @vshlapakov! I cherry-picked your commit and updated my tests. Everything passes now! 🎉

vshlapakov commented 4 years ago

@BurnzZ Awesome, thanks, it's a pleasure to work with you! :rocket: @manycoding Thanks a lot for help in CR and your suggestions! :raised_hands:

BurnzZ commented 4 years ago

Thanks for all the help @vshlapakov @manycoding! ❤️ Can't wait to use this in most of our projects! 🎉