petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
66 stars 17 forks source link

getpapers FATAL ERROR when downloading 30k articles or more #64

Open EmanuelFaria opened 4 years ago

EmanuelFaria commented 4 years ago

Hi @petermr @deadlyvices

I got the error below last night trying to download more than 30k at a pop. When I limit my download to 29k or below (-k 29000), the problem goes away.

I found a discussion online about a fix to a similar-looking error here ... but even if this is useful, I don't know how to implement it. :-/

HERE'S WHAT HAPPENS:

It seems to stall in the "Retrieving results" phase. I get is these incremental updates as below

Retrieving results [==----------------------------] 97% Retrieving results [==----------------------------] 98%

and then this error message:

<--- Last few GCs ---> at[39301:0x110000000] 739506 ms: Mark-sweep 2056.4 (2064.7) -> 2055.5 (2064.7) MB, 835.1 / 0.0 ms (+ 0.0 ms in 4 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 837 ms) (average mu = 0.082, current mu = 0.002) allocatio[39301:0x110000000] 740780 ms: Mark-sweep 2056.5 (2064.7) -> 2055.7 (2064.9) MB, 1134.1 / 0.0 ms (+ 0.0 ms in 15 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1273 ms) (average mu = 0.098, current mu = 0.109) alloca

<--- JS stacktrace --->

==== JS stack trace =========================================

0: ExitFrame [pc: 0x100950919]
1: StubFrame [pc: 0x1009519b3]

Security context: 0x1e91dd3c08d1 2: write [0x1e91f4b412d9] [/usr/local/lib/node_modules/getpapers/node_modules/xml2js/node_modules/sax/lib/sax.js:~965] [pc=0x33fd2f286764](this=0x1e91b044eec1 ,0x1e9221e00119 <Very long string[8244830]>) 3: / anonymous / [0x1e91c4bc7169] [/usr/local/lib/node_modules/getpapers/node_modu...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Writing Node.js report to file: report.20200415.123321.39301.0.001.json Node.js report completed 1: 0x100080c68 node::Abort() [/usr/local/bin/node] 2: 0x100080dec node::errors::TryCatchScope::~TryCatchScope() [/usr/local/bin/node] 3: 0x100185167 v8::Utils::ReportOOMFailure(v8::internal::Isolate, char const, bool) [/usr/local/bin/node] 4: 0x100185103 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate, char const, bool) [/usr/local/bin/node] 5: 0x10030b2f5 v8::internal::Heap::FatalProcessOutOfMemory(char const) [/usr/local/bin/node] 6: 0x10030c9c4 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/usr/local/bin/node] 7: 0x100309837 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/local/bin/node] 8: 0x1003077fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node] 9: 0x100312fba v8::internal::Heap::AllocateRawWithLightRetry(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node] 10: 0x100313041 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node] 11: 0x1002e035b v8::internal::factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/local/bin/node] 12: 0x100618718 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long, v8::internal::Isolate*) [/usr/local/bin/node] 13: 0x100950919 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/usr/local/bin/node] 14: 0x1009519b3 Builtins_StringAdd_CheckNone [/usr/local/bin/node] 15: 0x33fd2f286764 Abort trap: 6

ziflex commented 4 years ago

It seems that you have hit NodeJS's (V8) default heap memory limit.

You can try to run the app using the following args to increase the limit:

node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS

Either, there is a memory leak or design flaw that does not use data streaming in order to prevent such errors.

petermr commented 4 years ago

A messy work-around is to search by date-slices.

Search after (say) 2016 search before 2016

and combine the results.

On Wed, Jun 17, 2020 at 9:17 PM Tim Voronov notifications@github.com wrote:

It seems that you have hit NodeJS's (V8) default heap memory limit.

You can try to run the app using the following args:

node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS

Either, there is a memory leak or design flaw that does not use data streaming in order to preven such errors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/64#issuecomment-645601357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6VMQ2ESBNPTA47OSDRXEQEPANCNFSM4OAYUG6Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

It seems that you have hit NodeJS's (V8) default heap memory limit.

You can try to run the app using the following args to increase the limit:

node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS

Either, there is a memory leak or design flaw that does not use data streaming in order to prevent such errors.

@ziflex Thanks for the tip!

I tried it the terminal command you posted, but it seemed my getpapers.js was hiding. Here's what I got:

Last login: Sun Jun 21 18:37:21 on console `Mannys-MacBook-Pro:~ emanuelfaria$ node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS internal/modules/cjs/loader.js:969 throw err; ^

_Error: Cannot find module '/Users/emanuelfaria/bin/getpapers.js' at Function.Module._resolveFilename (internal/modules/cjs/loader.js:966:15) at Function.Module._load (internal/modules/cjs/loader.js:842:27) at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12) at internal/main/run_main_module.js:17:47 { code: 'MODULE_NOTFOUND', requireStack: [] }

So I searched my drive and found it, and updated the snippet like so:

_node --max-old-space-size=8192 /usr/local/lib/node_modules/getpapers/bin/getpapers.js OTHERARGS

This is the response I received... any idease?

error: No query given. You must provide the --query argument.

ziflex commented 4 years ago

By OTHER_ARGS I implied any valid getpapers arguments :)

petermr commented 4 years ago

It maybe we should use 'curl' or 'ferret.ami downloadhas a curl wrapper . I haven't used it on EPMC but it should be fairly straightforward to manage the cursor and do 1000 papers at a time. I don't see any obvious memory leaks withcurl`.

the point is that when you have got to 30000 papers you know your query well and can afford to take the time to use another tool.

BTW make sure you know what you are going to do with 30,000 articles. You don't want to download huge amounts if the processing has problems.

On Tue, Jun 23, 2020 at 2:58 AM Tim Voronov notifications@github.com wrote:

By OTHER_ARGS I implied any valid getpapers arguments :)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/64#issuecomment-647860065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3MBON35MHEAP6GAEDRYAD3HANCNFSM4OAYUG6Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

By OTHER_ARGS I implied any valid getpapers arguments :)

Ooooohhhhhhh LOL. Thanks!

EmanuelFaria commented 4 years ago

do 1000 papers at a time.

@peter I don't know what curl or ferret means, but I think downloading 1000 at a time, rather than (what looks to me like) processing them 1000 at a time, and then waiting until the end to download them all is a good idea.

petermr commented 4 years ago

curl and ferret and 'scrapyare possible alternatives togetpapers`.

They all download papers automatically. (the technical details may differ). But you can say "please download this many from EPMC).

The papers are only useful if they are read or analyzed. I'm suggesting that you make sure you can do this for the first 30,000 OK before downloading more.

On Wed, Jun 24, 2020 at 1:21 PM Emanuel Faria notifications@github.com wrote:

do 1000 papers at a time.

@peter https://github.com/peter I don't know what curl or ferret means, but I think downloading 1000 at a time, rather than (what looks to me like) processing them 1000 at a time, and then waiting until the end to download them all is a good idea.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/petermr/openVirus/issues/64#issuecomment-648786121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS3VB5LI22MHBKESIU3RYHVSZANCNFSM4OAYUG6Q .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria commented 4 years ago

ok. Thanks Peter.