Some large org not splitted

rivernews commented 4 years ago

Supervisor job parameter

{
  "splittedScraperJobRequestData": {
    "nextReviewPageUrl": "https://www.glassdoor.com/Reviews/eBay-Reviews-E7853_P253.htm",
    "pubsubChannelName": "scraperJobChannel:eBay:0:startAtPage253",
    "orgId": "7853",
    "orgName": "eBay",
    "scrapeMode": "renewal",
    "shardIndex": 9,
    "lastProgress": {
      "processed": 0,
      "wentThrough": 0,
      "total": 2787,
      "durationInMilli": "1",
      "page": 252,
      "processedSession": 0
    }
  }
}

At the time being, eBay already has several review meta record in s3, so s3 job should be able to get its review count, and found that it exceeds the job size, and then splitting it to smaller jobs.

But the went through number, along with the 3 session used - kind of indicating that it's not being splitted properly. BTW in this s3 session we are having splitted size of 280.

Another weird thing is we have "shardIndex": 9,, so maybe it does splited properly, and that's the last splitted job. We see the next review url is indeed ..._P254.htm, so this job is really scraping #2540~#2787 reviews. But the see it keeps scraping and eventually have went through of 2800.

So this might be caused by the way we fix #75 - where we only check if wentThrough > total. But this is not true for last split job. So it may over-scraped. It does scraped all the reviews, but then it start get stuck. We either

When spliting jobs, the last split job, we assign the actual job total number, not the org total.
- This should prevent us from attempting to step into next page. But this shouldn't be an issue if our element-based termination logic is correct, that's the root cause, see below.
- And indeed we are not really interested in the org total when looking at a split job data. So it's fine to just use actual remaining review num.
We add extra logic in while loop termination condition.
- If we based on url... then it's not enough, we may still guess and access an url by page number + 1. But yes, we can perhaps tell termination by a failed locate review panel
- If we based on failed find link and click, then it's not quite reliable; but since we put click approach after guess url approach, I guess it's fine to finalize when find link and click fails.
- If we based on locate review panel, we do want to terminate.

rivernews commented 4 years ago

Fix and observe

We fixed the next page logic and also fix some of the element locating failure handling, so that the job can be correctly finalized / terminated when no further reviews are available, which indicates no valid next page exists.

Then let's see if such issue goes away in the new bench mark s3 session. i.e., when the s3 job take too long, and scraper job session get too high, then it may indicate the scraper gets "stuck" at next page logic.

This issue starts coming up after recent fix of java scraper. Why is this happening? It might look like semaphore object issue, but when looking further into the error, it shows java scraper having error accessing webdriver: Error communicating with the remote browser. It may have died.

In terms of memory utilization, it looks fine: So we may exclude the possibility that not being able to access web driver is due to memory pressure.

The same issue occurs also in local dev. Looks like something's wrong between

Webdriver is dead
Clean up semaphore failed, causing previous semaphore not cleaned up error

We can now isolate and reproduce the issue when

The org is big enough such that job is splited
The last splited job, here, for example, job 17, will always throw error from the java scraper first Error communicating with the remote browser. It may have died.. While it's hard to trace, it always happen at the last split job!
- Then the SLK scraper job 17 failed.
- [ ] But did the SLK received java scraper's error signal?
- [ ] Did SLK clean up itself after error signal?
- Then SLK job 17 error occurred saying previous sessionSemaphoreCollection is not cleaned up. Is that error coming from the java scraper webdriver error? Or is it the following job?
- Then SLK job 17 reported timeout.

rivernews commented 4 years ago

Benchmark

We now fix the logic, so failing locating element as a measure to detect next page. Some things to notice

Pay attention to warning - especially the low went through rate warning. It means that the failing logic fails where it should not - that is - there're still next page following, but the logic failed to detect. There're reviews not yet scraped, and we missed them in this case.

Other than that, current fix should resolve all the weird webdriver crash issues, which is likely caused by too frequent call to driver.get.

Let's closely observe the benchmark result.

Benchmark Result

Except several timeouts, everything looks good! (haven't dig into the slack log seeing if there's any low went through rate warning (but we should exclude last splited job), otherwise overall looks good. Next steps we may

[ ] Investigate those timeouts. Seems like when scraper reports error, the SLK job cannot clean up the scheduler.
[ ] Improve logging system
[ ] Let last splited job total be the right actual value, not the org total

rivernews / slack-middleware-server