Closed rivernews closed 4 years ago
We found a second case here:
INFO: OK, glassdoor login complete!
Scraper task started by url: https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm
INFO: Basic data parsing completed, elasped time:
00 hours : 00 minutes : 04 seconds . 59 milliseconds
INFO: Local review count is 3278, we will scrape within these reviews.
INFO: Review already existed in our archive, will not proceed with the rest of reviews since they should already ben archived based on the most-recent ordering.
INFO: ======= Success! =======
Processed reviews count: 10
Duration: 00 hours : 00 minutes : 18 seconds . 487 milliseconds
The case is Paypal at https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm.
The scraper reports finding an existing (duplicated) review, and stopped with 10 reviews scraped, while the total local review should be 32xx.
We'll try to re-run again to verify this. Yes we can confirm even if we delete all data in S3, and re-run, the result is the same.
We will move out the existing review data first. Move them to another named directory like archived.reviews.2020-02-10-2148
.
aws
command here. Just activate a venv, install aws-cli
. AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, AWS_REGION
.aws s3 mv "s3://shaungc-qualitative-org-review/Walmart eCommerce-29449/reviews/" "s3://shaungc-qualitative-org-review/Walmart eCommerce-29449/archived.reviews.2020-02-10_2148/" --recursive
aws s3 ls s3://shaungc-qualitative-org-review/C3.ai-312703/meta/\*
will do the same as aws s3 ls s3://shaungc-qualitative-org-review/C3.ai-312703/meta --recursive
.Run again scraper, this time we have more robust logs, so hopefully this will give us the cause. Let's look at walmart first.
Both orgs data is populating now, each review lost < 3%.
Proposed Idea
To help investigate, we may stop using shortcut when found duplicated review id. We can still proceed w/o stopping, but altering the writed-to filename for S3 as something like
...<reivew-id>-1.json
, adding a trailing number, marking it as a different stuff, or like a collision there. Of course, this does not solve the issue, just for investigation purpose. We still need to find out the behavior, then think of how to deal with it.Info on logs
At review # 400 (most recent), encountered an existing review id. But local review count is > 800.
What review id found duplicated? Are they really the same review?
Below is the log:
Relevant S3 directory at Walmart eCommerce.
Investigation
At #41 review got collision, id
30322331
. As you can see in page 4, the last review, and page 5, the first review. They are identical. So perhaps we can say, it's say to just neglect or overwrite.We can choose 3 strategies:
Always overwriteAlways write collided review.Skip upon collision
Abort upon collision
[ ] Issue: cannot do MD5 check because of a scraper timestamp field in review data json. Suggestion: move timestamp to key name instead; the review data will be like a snapshot of the actual review.