Sometimes cannot proceed to next page

rivernews commented 4 years ago

Action Items

[ ] (Changing nothing, resume at stopped page) create some frontend to input renewal job request data, so we can continue on the interrupted last page to debug.
[ ] (If got stopped again, modify capture logic)

Description

Scraper sometimes did not recognize next page link, so it marks the entire scrapping as complete. This caused a large loss of data, low processing rate (30%~66%).

Apple scraper reported 28.6% processing rate.
Microsoft scraper reported 66.4% processing rate.
Amazon scraper reported 62% processing rate.

After initial inspection on the current webpage by url, it shows the frontend is changed, and uses a different DOM hierarchy and class names.

Not sure if this new change only occurs occasionally, or this is a recent change in glassdoor, and will be as-is later on. Either way, we should establish a way to capture the desired element. We can still leave the current capture logic as-is because it might be a A-B testing thing and could appear occasionally.

Still, do note that some org did give desired result:

Salesforce scraper reported > 97% process rate. In fact, for orgs that have review < 10K we did not see this issue, even for google (13K), it completed successfully. And even if Microsoft and Apple have the low process rate issue, they still proceed many pages, meaning that the current capture logic (next link) still work in some way, so definitely do not remove them.

rivernews commented 4 years ago

Creating a simple frontend

Ideally we want session auth with social login on express (link tutorial uses EJS rending engine). However, we want some fast way to just send data to backend with some very basic auth, so we might just use a form with fixed token:

Minimal Fast deployment
- [x] Create react app using npx, adding typescript
- [x] Deploy it on github page.
- [x] Setup our custom domain name, and link that to deployment on github page.
[x] Create a simple form, allowing user to input all required cross request data.
[x] Let the form send a POST request. Just use token input to try to auth. We'll make this more secure later in other issue ticket.
[x] Backend system provide corresponding API endpoint - this endpoint is used to create resume jobs.
[x] The entire test - input data, send, backend check token, backend dispatch job, scraper running.

After we've done this, we can continue investigating the interrupted issue.

Supplementary resources

Session authentication in express

Test data for apple:

orgId: 1138
orgName: Apple
lastProgress.processed: 4413
lastProgress.wentThrough: 4600
lastProgress.total: 15407
lastProgress.durationInMilli: 5430000
lastProgress.page: 460 // `lastReviewPage` is next link, so look at page in that url, then minus one
lastProgress.processedSession: 3
lastReviewPage: https://www.glassdoor.com/Reviews/Apple-Reviews-E1138_P461.htm  // actually is next link

Test data for Microsoft:

orgId: 1651
orgName: Microsoft
lastProgress.processed: 13770
lastProgress.wentThrough: 14040
lastProgress.total: 20732
lastProgress.durationInMilli: 18976000  ... `(5*60+16+16/60) min = milliseconds`
lastProgress.page: 1404
lastProgress.processedSession: 8
lastReviewPage: https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P1405.htm

Test data for Amazon:

orgId: 6036
orgName: Amazon
lastProgress.processed: 23157
lastProgress.wentThrough: 24130
lastProgress.total: 37231
lastProgress.durationInMilli: 29818000  ... `(5*60+16+16/60) min = milliseconds`
lastProgress.page: 2413
lastProgress.processedSession: 12
lastReviewPage: https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036_P2414.htm

rivernews commented 4 years ago

After retry:

Microsoft - 20444/20732 not bad, 98.6%
Apple - 15103/15407 - 98%, nice job!
Amazon - 25117/37231, 67.45%, still having issue.

Added new approach for capturing next page link https://github.com/rivernews/review-scraper-java-development-environment/pull/28, let's retry Amazon again.

Amazon - still not working well, 26283/37231, 70.59%. From page 2619, to page 2742. Not the end of the day, still processed 123 page.

Got 87% in Amazon now, which is nice, but still has some issue. After looking at the dump html, we see indeed it's a variation of webpage.

Finally we got a 92.2%. While it's not close to 97% or something like that, indeed when visiting the last processed review page, the next page link is grayed out.

We may still encounter a case where all next link approach failed. At that time we will follow the same way to tackle this - download the html from s3, look at the structure, and develop new approach if necessary, then resume job and check again -- use SLK frontend to trigger the renewal job.

rivernews / slack-middleware-server