rivernews / slack-middleware-server

This server act as a middleware to communicate with Slack API.
1 stars 1 forks source link

Sometimes cannot proceed to next page #25

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

Action Items

Description

Scraper sometimes did not recognize next page link, so it marks the entire scrapping as complete. This caused a large loss of data, low processing rate (30%~66%).

After initial inspection on the current webpage by url, it shows the frontend is changed, and uses a different DOM hierarchy and class names.

Not sure if this new change only occurs occasionally, or this is a recent change in glassdoor, and will be as-is later on. Either way, we should establish a way to capture the desired element. We can still leave the current capture logic as-is because it might be a A-B testing thing and could appear occasionally.

Still, do note that some org did give desired result:

rivernews commented 4 years ago

Creating a simple frontend

Ideally we want session auth with social login on express (link tutorial uses EJS rending engine). However, we want some fast way to just send data to backend with some very basic auth, so we might just use a form with fixed token:

After we've done this, we can continue investigating the interrupted issue.

Supplementary resources

rivernews commented 4 years ago

After retry:

Added new approach for capturing next page link https://github.com/rivernews/review-scraper-java-development-environment/pull/28, let's retry Amazon again.

Got 87% in Amazon now, which is nice, but still has some issue. After looking at the dump html, we see indeed it's a variation of webpage.

Finally we got a 92.2%. While it's not close to 97% or something like that, indeed when visiting the last processed review page, the next page link is grayed out.

We may still encounter a case where all next link approach failed. At that time we will follow the same way to tackle this - download the html from s3, look at the structure, and develop new approach if necessary, then resume job and check again -- use SLK frontend to trigger the renewal job.