Closed ZacMilano closed 4 years ago
After some research on the issue, most of the solutions suggest increasing the 'UV_THREADPOOL_SIZE' environment variable. I did that and let's wait and see if we get this issue again.
Closing for now, hoping it was resolved.
I created a issue in mercury github, but did not hear anything. https://github.com/postlight/mercury-parser/issues/423
We discussed the issue. Until/unless we get a response from the mercury-parser developers, we can suppress the timeout warning by catching the exception and treat it as if we had failed to extract the needed fields from the parser. @chathuriw will take care of this part.
A more effective and efficient solution would be to skip the re-fetching of the page by Mercury via the URL, since the page has already been downloaded and is in our database. If Mercury has an internal function that takes the HTML source, we could call that function directly.
The same may apply to paper3k for cases in which Mercury fails.
@filmenczer I catch the exception and checking for the exit code from the node js file.
Great --- moving this from Python 3 Port milestone.
Warnings are continuing...
The warning is solved, now we need to see how often the problem occurs. @chathuriw will get a rough estimate of how often the Mercury parser fails (%age of documents per day), so we can decide whether we need to look further into the parsing errors.
Then we can decide whether to optimize to avoid repeated downloads of the same page.
It seems the events triggering the warning are rare (one in a few days). @chathuriw will check %age and assuming it is low, will close issue.
Sorry I did a miscount of the errors. It seems we are getting more errors. From 2019-07-05, we got 109 error messages. The total number of articles we passed after that day is 9068. so the percentage is 1.2%.
1.2% is not huge but also not as small as we hoped... Would it be possible to look at some pages that are triggering these errors, to see if they are inevitable (problems with the pages that we cannot control) and therefore should be ignored? If instead the pages look okay, then we need to look for problems in the parser... Do the logs have the problematic URLs?
Here are some of the urls.
http://www.newsbiscuit.com/2019/07/15/northern-irelands-bonfires-just-a-bit-of-traditional-harmless-fun-say-dup/ http://www.newsbiscuit.com/2019/07/15/tennis-courts-fully-booked-up-until-wednesday/ http://www.newsbiscuit.com/2019/07/15/gay-activist-defends-idiot-reversion-therapy/ https://www.redstate.com/setonmotley/2019/07/15/localities-shouldn%E2%80%99t-dictating-inter-national-policy/ https://stage.redstate.com/darth641/2019/07/11/president%E2%80%99s-twitter-now-immune-banning/ https://thefreethoughtproject.com/veterans-create-force-to-expose-pedophiles-and-rescue-trafficked-children/?fbclid=IwAR2y60YSq33Yy33w2dp54m_JWgDUpVEBbsuCup4gHeym4OOa-ep0rBV_iS4
Thanks -- these pages have lots of html errors (see validator.w3.org) and one is password protected.
On the other hand, these issues may be common among, say, WP sites...
@shaochengcheng @ZacMonroe any ideas??
We tested the problematic articles using https://newspaper-demo.herokuapp.com/ and it seems that paper3k can parse them. Therefore @chathuriw will handle these errors as we do for articles where we are unable to extract fields, so that the article will then be parsed via paper3k.
@chathuriw implemented the fix. She will close this issue once she checks the logs and confirms that it is working as planned.
I checked the logs and most of the times, it fails with Newspaper3k as well with empty content and title.
Okay. Nothing we can do about that, but at least we know we are making the best effort. Assuming we are dealing with those cases appropriately -- ignore if both parsers fail -- you can close the issue. Thanks!
When Cron job does the following command:
Cron <truthy@burns> source activate hoaxy-be-py3 && hoaxy --console-log-level=critical crawl --parse-article --limit=10000
we get the error:
(node:10396) UnhandledPromiseRejectionWarning: Error: ESOCKETTIMEDOUT at ClientRequest.<anonymous> (/nfs/nfs7/home/truthy/node_modules/postman-request/request.js:1025:19) at Object.onceWrapper (events.js:277:13) at ClientRequest.emit (events.js:189:13) at TLSSocket.emitRequestTimeout (_http_client.js:662:40) at Object.onceWrapper (events.js:277:13) at TLSSocket.emit (events.js:189:13) at TLSSocket.Socket._onTimeout (net.js:440:8) at ontimeout (timers.js:436:11) at tryOnTimeout (timers.js:300:5) at listOnTimeout (timers.js:263:5) at Timer.processTimers (timers.js:223:10) (node:10396) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1) (node:10396) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.