rivernews / review-scraper-java-development-environment

An environment to develop review scraper
0 stars 1 forks source link

Data pipeline: review data missing. scraper ended successfully, but processed review count << local review count #23

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

Slack message of the scraper log.

Org: groupon

Hint: observe the url page number & the processed review count. You can see basically one page contains 10 reviews.

rivernews commented 4 years ago

How to cope with this?

But why?

Eventually we want to find out why we lose so much reviews. Or why the scraper did not see the next page link and exit by 0?

rivernews commented 4 years ago

Investigation

When we re-tried groupon, we got around 6xx results. Seems like the abortion is due to no studout in 10 minutes. This is due to we tweak the loglevel in travis to 2, warning. Changing to 3 INFO should solve this issue. The original problem's cause is still not identified.

After we run the 2nd time, it's now:

Processed reviews count: 2443/2696
Duration: 0h:32min:22s.905

Seems like not a big deal here looking at the review count rate. However we do want to verify if there is no further page.

Based on the page number, we believe that we did retrieve all we have. So this indicate that there's a gap between the shown local count and the actual reviews available.