rivernews / slack-middleware-server

This server act as a middleware to communicate with Slack API.
1 stars 1 forks source link

Some large org not splitted #81

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

image

Supervisor job parameter

{
  "splittedScraperJobRequestData": {
    "nextReviewPageUrl": "https://www.glassdoor.com/Reviews/eBay-Reviews-E7853_P253.htm",
    "pubsubChannelName": "scraperJobChannel:eBay:0:startAtPage253",
    "orgId": "7853",
    "orgName": "eBay",
    "scrapeMode": "renewal",
    "shardIndex": 9,
    "lastProgress": {
      "processed": 0,
      "wentThrough": 0,
      "total": 2787,
      "durationInMilli": "1",
      "page": 252,
      "processedSession": 0
    }
  }
}

At the time being, eBay already has several review meta record in s3, so s3 job should be able to get its review count, and found that it exceeds the job size, and then splitting it to smaller jobs.

But the went through number, along with the 3 session used - kind of indicating that it's not being splitted properly. BTW in this s3 session we are having splitted size of 280.

Another weird thing is we have "shardIndex": 9,, so maybe it does splited properly, and that's the last splitted job. We see the next review url is indeed ..._P254.htm, so this job is really scraping #2540~#2787 reviews. But the see it keeps scraping and eventually have went through of 2800.

So this might be caused by the way we fix #75 - where we only check if wentThrough > total. But this is not true for last split job. So it may over-scraped. It does scraped all the reviews, but then it start get stuck. We either

rivernews commented 4 years ago

Fix and observe

We fixed the next page logic and also fix some of the element locating failure handling, so that the job can be correctly finalized / terminated when no further reviews are available, which indicates no valid next page exists.

Then let's see if such issue goes away in the new bench mark s3 session. i.e., when the s3 job take too long, and scraper job session get too high, then it may indicate the scraper gets "stuck" at next page logic.


image

This issue starts coming up after recent fix of java scraper. Why is this happening? It might look like semaphore object issue, but when looking further into the error, it shows java scraper having error accessing webdriver: Error communicating with the remote browser. It may have died.

In terms of memory utilization, it looks fine: image So we may exclude the possibility that not being able to access web driver is due to memory pressure.

The same issue occurs also in local dev. Looks like something's wrong between


We can now isolate and reproduce the issue when

rivernews commented 4 years ago

Benchmark

We now fix the logic, so failing locating element as a measure to detect next page. Some things to notice

Other than that, current fix should resolve all the weird webdriver crash issues, which is likely caused by too frequent call to driver.get.

Let's closely observe the benchmark result.

Benchmark Result

Except several timeouts, everything looks good! (haven't dig into the slack log seeing if there's any low went through rate warning (but we should exclude last splited job), otherwise overall looks good. Next steps we may