osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Determine cause of UnhandledPromiseRejectionWarning #38

Closed ZacMilano closed 4 years ago

ZacMilano commented 5 years ago

When Cron job does the following command:

Cron <truthy@burns> source activate hoaxy-be-py3 && hoaxy --console-log-level=critical crawl --parse-article --limit=10000

we get the error:

(node:10396) UnhandledPromiseRejectionWarning: Error: ESOCKETTIMEDOUT at ClientRequest.<anonymous> (/nfs/nfs7/home/truthy/node_modules/postman-request/request.js:1025:19) at Object.onceWrapper (events.js:277:13) at ClientRequest.emit (events.js:189:13) at TLSSocket.emitRequestTimeout (_http_client.js:662:40) at Object.onceWrapper (events.js:277:13) at TLSSocket.emit (events.js:189:13) at TLSSocket.Socket._onTimeout (net.js:440:8) at ontimeout (timers.js:436:11) at tryOnTimeout (timers.js:300:5) at listOnTimeout (timers.js:263:5) at Timer.processTimers (timers.js:223:10) (node:10396) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1) (node:10396) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

chathuriw commented 5 years ago

After some research on the issue, most of the solutions suggest increasing the 'UV_THREADPOOL_SIZE' environment variable. I did that and let's wait and see if we get this issue again.

filmenczer commented 5 years ago

Closing for now, hoping it was resolved.

chathuriw commented 5 years ago

I created a issue in mercury github, but did not hear anything. https://github.com/postlight/mercury-parser/issues/423

filmenczer commented 5 years ago

We discussed the issue. Until/unless we get a response from the mercury-parser developers, we can suppress the timeout warning by catching the exception and treat it as if we had failed to extract the needed fields from the parser. @chathuriw will take care of this part.

A more effective and efficient solution would be to skip the re-fetching of the page by Mercury via the URL, since the page has already been downloaded and is in our database. If Mercury has an internal function that takes the HTML source, we could call that function directly.

The same may apply to paper3k for cases in which Mercury fails.

chathuriw commented 5 years ago

@filmenczer I catch the exception and checking for the exit code from the node js file.

filmenczer commented 5 years ago

Great --- moving this from Python 3 Port milestone.

filmenczer commented 5 years ago

Warnings are continuing...

filmenczer commented 5 years ago

The warning is solved, now we need to see how often the problem occurs. @chathuriw will get a rough estimate of how often the Mercury parser fails (%age of documents per day), so we can decide whether we need to look further into the parsing errors.

Then we can decide whether to optimize to avoid repeated downloads of the same page.

filmenczer commented 5 years ago

It seems the events triggering the warning are rare (one in a few days). @chathuriw will check %age and assuming it is low, will close issue.

chathuriw commented 5 years ago

Sorry I did a miscount of the errors. It seems we are getting more errors. From 2019-07-05, we got 109 error messages. The total number of articles we passed after that day is 9068. so the percentage is 1.2%.

filmenczer commented 5 years ago

1.2% is not huge but also not as small as we hoped... Would it be possible to look at some pages that are triggering these errors, to see if they are inevitable (problems with the pages that we cannot control) and therefore should be ignored? If instead the pages look okay, then we need to look for problems in the parser... Do the logs have the problematic URLs?

chathuriw commented 5 years ago

Here are some of the urls.

http://www.newsbiscuit.com/2019/07/15/northern-irelands-bonfires-just-a-bit-of-traditional-harmless-fun-say-dup/ http://www.newsbiscuit.com/2019/07/15/tennis-courts-fully-booked-up-until-wednesday/ http://www.newsbiscuit.com/2019/07/15/gay-activist-defends-idiot-reversion-therapy/ https://www.redstate.com/setonmotley/2019/07/15/localities-shouldn%E2%80%99t-dictating-inter-national-policy/ https://stage.redstate.com/darth641/2019/07/11/president%E2%80%99s-twitter-now-immune-banning/ https://thefreethoughtproject.com/veterans-create-force-to-expose-pedophiles-and-rescue-trafficked-children/?fbclid=IwAR2y60YSq33Yy33w2dp54m_JWgDUpVEBbsuCup4gHeym4OOa-ep0rBV_iS4

filmenczer commented 5 years ago

Thanks -- these pages have lots of html errors (see validator.w3.org) and one is password protected.

On the other hand, these issues may be common among, say, WP sites...

@shaochengcheng @ZacMonroe any ideas??

filmenczer commented 5 years ago

We tested the problematic articles using https://newspaper-demo.herokuapp.com/ and it seems that paper3k can parse them. Therefore @chathuriw will handle these errors as we do for articles where we are unable to extract fields, so that the article will then be parsed via paper3k.

filmenczer commented 5 years ago

@chathuriw implemented the fix. She will close this issue once she checks the logs and confirms that it is working as planned.

chathuriw commented 4 years ago

I checked the logs and most of the times, it fails with Newspaper3k as well with empty content and title.

filmenczer commented 4 years ago

Okay. Nothing we can do about that, but at least we know we are making the best effort. Assuming we are dealing with those cases appropriately -- ignore if both parsers fail -- you can close the issue. Thanks!