Data Pipeline: data integrity: potential duplicated review

rivernews commented 4 years ago

Proposed Idea

To help investigate, we may stop using shortcut when found duplicated review id. We can still proceed w/o stopping, but altering the writed-to filename for S3 as something like ...<reivew-id>-1.json, adding a trailing number, marking it as a different stuff, or like a collision there. Of course, this does not solve the issue, just for investigation purpose. We still need to find out the behavior, then think of how to deal with it.

Info on logs

At review # 400 (most recent), encountered an existing review id. But local review count is > 800.

What review id found duplicated? Are they really the same review?

Below is the log:

INFO: OK, glassdoor login complete!
12:12
Scraper task started by url: https://www.glassdoor.com/Overview/Working-at-Walmart-eCommerce-EI_IE29449.11,28.htm
12:12
INFO: Basic data parsing completed, elasped time:
00 hours : 00 minutes : 06 seconds . 92 milliseconds
12:12
INFO: Local review count is 864, we will scrape within these reviews.
12:15
INFO: https://www.glassdoor.com/Reviews/Walmart-eCommerce-Reviews-E29449_P22.htm
On this page presents 10 elements
So far processed 216/864 reviews, keep processing for the next 216 reviews ... (processed page count 21)
12:18
INFO: Review already existed in our archive, will not proceed with the rest of reviews since they should already ben archived based on the most-recent ordering.
12:18
INFO: ======= Success! =======
Processed reviews count: 400
Duration: 00 hours : 05 minutes : 42 seconds . 7 milliseconds

Relevant S3 directory at Walmart eCommerce.

Investigation

At #41 review got collision, id 30322331. As you can see in page 4, the last review, and page 5, the first review. They are identical. So perhaps we can say, it's say to just neglect or overwrite.

We can choose 3 strategies:

~~Always overwrite~~ Always write collided review.
Skip upon collision
Abort upon collision
[ ] Issue: cannot do MD5 check because of a scraper timestamp field in review data json. Suggestion: move timestamp to key name instead; the review data will be like a snapshot of the actual review.

rivernews commented 4 years ago

We found a second case here:

INFO: OK, glassdoor login complete!
Scraper task started by url: https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm
INFO: Basic data parsing completed, elasped time:
00 hours : 00 minutes : 04 seconds . 59 milliseconds
INFO: Local review count is 3278, we will scrape within these reviews.
INFO: Review already existed in our archive, will not proceed with the rest of reviews since they should already ben archived based on the most-recent ordering.

INFO: ======= Success! =======
Processed reviews count: 10
Duration: 00 hours : 00 minutes : 18 seconds . 487 milliseconds

The case is Paypal at https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm.

The scraper reports finding an existing (duplicated) review, and stopped with 10 reviews scraped, while the total local review should be 32xx.

We'll try to re-run again to verify this. Yes we can confirm even if we delete all data in S3, and re-run, the result is the same.

rivernews commented 4 years ago

Approach

We will move out the existing review data first. Move them to another named directory like archived.reviews.2020-02-10-2148.
- We'll use aws command here. Just activate a venv, install aws-cli.
- Now feed the credentials with environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION.
- Move command: aws s3 mv "s3://shaungc-qualitative-org-review/Walmart eCommerce-29449/reviews/" "s3://shaungc-qualitative-org-review/Walmart eCommerce-29449/archived.reviews.2020-02-10_2148/" --recursive
- List command: aws s3 ls s3://shaungc-qualitative-org-review/C3.ai-312703/meta/\* will do the same as aws s3 ls s3://shaungc-qualitative-org-review/C3.ai-312703/meta --recursive.
- Also refer to aws doc
Run again scraper, this time we have more robust logs, so hopefully this will give us the cause. Let's look at walmart first.

rivernews commented 4 years ago

Both orgs data is populating now, each review lost < 3%.

rivernews / review-scraper-java-development-environment