rivernews / review-scraper-java-development-environment

An environment to develop review scraper
0 stars 1 forks source link

Data Pipeline: data integrity: potential duplicated review #15

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

Proposed Idea

To help investigate, we may stop using shortcut when found duplicated review id. We can still proceed w/o stopping, but altering the writed-to filename for S3 as something like ...<reivew-id>-1.json, adding a trailing number, marking it as a different stuff, or like a collision there. Of course, this does not solve the issue, just for investigation purpose. We still need to find out the behavior, then think of how to deal with it.

Info on logs

At review # 400 (most recent), encountered an existing review id. But local review count is > 800.

What review id found duplicated? Are they really the same review?

Below is the log:

INFO: OK, glassdoor login complete!
12:12
Scraper task started by url: https://www.glassdoor.com/Overview/Working-at-Walmart-eCommerce-EI_IE29449.11,28.htm
12:12
INFO: Basic data parsing completed, elasped time:
00 hours : 00 minutes : 06 seconds . 92 milliseconds
12:12
INFO: Local review count is 864, we will scrape within these reviews.
12:15
INFO: https://www.glassdoor.com/Reviews/Walmart-eCommerce-Reviews-E29449_P22.htm
On this page presents 10 elements
So far processed 216/864 reviews, keep processing for the next 216 reviews ... (processed page count 21)
12:18
INFO: Review already existed in our archive, will not proceed with the rest of reviews since they should already ben archived based on the most-recent ordering.
12:18
INFO: ======= Success! =======
Processed reviews count: 400
Duration: 00 hours : 05 minutes : 42 seconds . 7 milliseconds

Relevant S3 directory at Walmart eCommerce.

Investigation

At #41 review got collision, id 30322331. As you can see in page 4, the last review, and page 5, the first review. They are identical. So perhaps we can say, it's say to just neglect or overwrite.

We can choose 3 strategies:

rivernews commented 4 years ago

We found a second case here:

INFO: OK, glassdoor login complete!
Scraper task started by url: https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm
INFO: Basic data parsing completed, elasped time:
00 hours : 00 minutes : 04 seconds . 59 milliseconds
INFO: Local review count is 3278, we will scrape within these reviews.
INFO: Review already existed in our archive, will not proceed with the rest of reviews since they should already ben archived based on the most-recent ordering.

INFO: ======= Success! =======
Processed reviews count: 10
Duration: 00 hours : 00 minutes : 18 seconds . 487 milliseconds

The case is Paypal at https://www.glassdoor.com/Overview/Working-at-PayPal-EI_IE9848.11,17.htm.

The scraper reports finding an existing (duplicated) review, and stopped with 10 reviews scraped, while the total local review should be 32xx.

We'll try to re-run again to verify this. Yes we can confirm even if we delete all data in S3, and re-run, the result is the same.

rivernews commented 4 years ago

Approach

rivernews commented 4 years ago

Both orgs data is populating now, each review lost < 3%.