Closed rivernews closed 4 years ago
Ideally we want session auth with social login on express (link tutorial uses EJS rending engine). However, we want some fast way to just send data to backend with some very basic auth, so we might just use a form with fixed token:
After we've done this, we can continue investigating the interrupted issue.
Test data for apple:
orgId: 1138
orgName: Apple
lastProgress.processed: 4413
lastProgress.wentThrough: 4600
lastProgress.total: 15407
lastProgress.durationInMilli: 5430000
lastProgress.page: 460 // `lastReviewPage` is next link, so look at page in that url, then minus one
lastProgress.processedSession: 3
lastReviewPage: https://www.glassdoor.com/Reviews/Apple-Reviews-E1138_P461.htm // actually is next link
Test data for Microsoft:
orgId: 1651
orgName: Microsoft
lastProgress.processed: 13770
lastProgress.wentThrough: 14040
lastProgress.total: 20732
lastProgress.durationInMilli: 18976000 ... `(5*60+16+16/60) min = milliseconds`
lastProgress.page: 1404
lastProgress.processedSession: 8
lastReviewPage: https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P1405.htm
Test data for Amazon:
orgId: 6036
orgName: Amazon
lastProgress.processed: 23157
lastProgress.wentThrough: 24130
lastProgress.total: 37231
lastProgress.durationInMilli: 29818000 ... `(5*60+16+16/60) min = milliseconds`
lastProgress.page: 2413
lastProgress.processedSession: 12
lastReviewPage: https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036_P2414.htm
After retry:
Added new approach for capturing next page link https://github.com/rivernews/review-scraper-java-development-environment/pull/28, let's retry Amazon again.
Got 87% in Amazon now, which is nice, but still has some issue. After looking at the dump html, we see indeed it's a variation of webpage.
Finally we got a 92.2%. While it's not close to 97% or something like that, indeed when visiting the last processed review page, the next page link is grayed out.
We may still encounter a case where all next link approach failed. At that time we will follow the same way to tackle this - download the html from s3, look at the structure, and develop new approach if necessary, then resume job and check again -- use SLK frontend to trigger the renewal job.
Action Items
Description
Scraper sometimes did not recognize next page link, so it marks the entire scrapping as complete. This caused a large loss of data, low processing rate (30%~66%).
After initial inspection on the current webpage by url, it shows the frontend is changed, and uses a different DOM hierarchy and class names.
Not sure if this new change only occurs occasionally, or this is a recent change in glassdoor, and will be as-is later on. Either way, we should establish a way to capture the desired element. We can still leave the current capture logic as-is because it might be a A-B testing thing and could appear occasionally.
Still, do note that some org did give desired result: