Closed rivernews closed 4 years ago
We fixed the next page logic and also fix some of the element locating failure handling, so that the job can be correctly finalized / terminated when no further reviews are available, which indicates no valid next page exists.
Then let's see if such issue goes away in the new bench mark s3 session. i.e., when the s3 job take too long, and scraper job session
get too high, then it may indicate the scraper gets "stuck" at next page logic.
This issue starts coming up after recent fix of java scraper. Why is this happening? It might look like semaphore object issue, but when looking further into the error, it shows java scraper having error accessing webdriver: Error communicating with the remote browser. It may have died.
In terms of memory utilization, it looks fine: So we may exclude the possibility that not being able to access web driver is due to memory pressure.
The same issue occurs also in local dev. Looks like something's wrong between
We can now isolate and reproduce the issue when
job 17
, will always throw error from the java scraper first Error communicating with the remote browser. It may have died.
. While it's hard to trace, it always happen at the last split job!
17
failed.job 17
error occurred saying previous sessionSemaphoreCollection is not cleaned up
. Is that error coming from the java scraper webdriver error? Or is it the following job?17
reported timeout.We now fix the logic, so failing locating element as a measure to detect next page. Some things to notice
low went through rate
warning. It means that the failing logic fails where it should not - that is - there're still next page following, but the logic failed to detect. There're reviews not yet scraped, and we missed them in this case.Other than that, current fix should resolve all the weird webdriver crash issues, which is likely caused by too frequent call to driver.get
.
Let's closely observe the benchmark result.
Except several timeouts, everything looks good! (haven't dig into the slack log seeing if there's any low went through rate warning (but we should exclude last splited job), otherwise overall looks good. Next steps we may
Supervisor job parameter
At the time being, eBay already has several review meta record in s3, so s3 job should be able to get its review count, and found that it exceeds the job size, and then splitting it to smaller jobs.
But the went through number, along with the 3 session used - kind of indicating that it's not being splitted properly. BTW in this s3 session we are having splitted size of 280.
Another weird thing is we have
"shardIndex": 9,
, so maybe it does splited properly, and that's the last splitted job. We see the next review url is indeed..._P254.htm
, so this job is really scraping #2540~#2787 reviews. But the see it keeps scraping and eventually have went through of 2800.So this might be caused by the way we fix #75 - where we only check if
wentThrough > total
. But this is not true for last split job. So it may over-scraped. It does scraped all the reviews, but then it start get stuck. We eitherlocate review panel
find link and click
, then it's not quite reliable; but since we put click approach after guess url approach, I guess it's fine to finalize whenfind link and click
fails.locate review panel
, we do want to terminate.