File handling & issues related to big files uploading

yaskevich commented 5 years ago

Some tests (via tasklist /FI "IMAGENAME eq chrome.exe", Win7 x64)

Starting browser (Chromium 77). start

Opening main page (with list of corpora) main1

Opening small corpus in Annotatrix (10 sentences) pushkin

Starting browser again start2

Main page again main2 Load 3Mb corpus from Github repo french

yaskevich commented 5 years ago

If I try to load really big corpus from Github (like this):

<--- Last few GCs --->

[38012:00000000003D35C0]   746913 ms: Scavenge 1367.9 (1435.5) -> 1353.3 (1436.5) MB, 6.1 / 0.0 ms  (average mu = 0.188, current mu = 0.147)
 allocation failure
[38012:00000000003D35C0]   747018 ms: Scavenge 1368.1 (1436.5) -> 1354.1 (1438.0) MB, 6.0 / 0.0 ms  (average mu = 0.188, current mu = 0.147)
 allocation failure
[38012:00000000003D35C0]   747131 ms: Scavenge 1368.5 (1438.0) -> 1354.3 (1439.0) MB, 7.5 / 0.0 ms  (average mu = 0.188, current mu = 0.147)
 allocation failure

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x03b013e9e6e9 <JSObject>
    0: builtin exit frame: concat(this=0x03c06c2833c1 <JSArray[35707]>,0x022a823b7ba9 <NxBaseClass map = 0000006C1924DCB1>,0x03c06c2833c1 <J
SArray[35707]>)

    1: /* anonymous */(aka /* anonymous */) [00000272DEE11361] [D:\dev\node\ud-annotatrix\node_modules\notatrix\src\nx\corpus.js:~310] [pc=000001F92A9B08A6](this=0x03bb868026f1 <undefined>,split=0x019004bca7f9 <String[694]\: # sent_id = 2013...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 000000013FE8F04A v8::internal::GCIdleTimeHandler::GCIdleTimeHandler+5114
 2: 000000013FE6A0C6 node::MakeCallback+4518
 3: 000000013FE6AA30 node_module_register+2032
 4: 00000001400F20EE v8::internal::FatalProcessOutOfMemory+846
 5: 00000001400F201F v8::internal::FatalProcessOutOfMemory+639
 6: 0000000140612BC4 v8::internal::Heap::MaxHeapGrowingFactor+9556
 7: 0000000140609C46 v8::internal::ScavengeJob::operator=+24310
 8: 000000014060829C v8::internal::ScavengeJob::operator=+17740
 9: 0000000140610F87 v8::internal::Heap::MaxHeapGrowingFactor+2327
10: 0000000140611006 v8::internal::Heap::MaxHeapGrowingFactor+2454
11: 00000001401CCBE8 v8::internal::Factory::AllocateRawArray+56
12: 00000001401CD562 v8::internal::Factory::NewFixedArrayWithFiller+66
13: 00000001402B33BB v8::internal::CodeStubAssembler::ConstexprBoolNot+18955
14: 00000001402B3EFE v8::internal::CodeStubAssembler::ConstexprBoolNot+21838
15: 00000001402B3CEB v8::internal::CodeStubAssembler::ConstexprBoolNot+21307
16: 000001F92A45C721
[nodemon] app crashed - waiting for file changes before starting...

yaskevich commented 5 years ago

The reason is that upload method calls notatrix which tries to parse in memory all the content received.

yaskevich commented 5 years ago

By the way, having in Annotatrix data base a couple of big corpora makes main page quite slow, because in order to list corpora it reads all the files and parses them – to output such features as date of creation, filename, number of sentences.

On test install I have now two corpora, both are about 2k sentences. Load time of main page is 8.89 sec. Loading such a corpus into editing interface takes about 6–8 sec.

jonorthwash commented 5 years ago

This is a notatrix bug and shouldn't be filed against annotatrix.

yaskevich commented 5 years ago

The reason is that upload method calls notatrix which tries to parse in memory all the content received.

I deeply investigated the issue, and as it happened it is more complex than it looked. It seems that Corpus and Sentence objects are very redundant and contain a lot of circular links. So, for example, serialized corpus takes 6-7 times more space than original file in CONLL format. In memory it is about ten times more.

Apart from that Corpus doesn't contain any descriptive info, like number of sentences, errors and stuff, all that stats could be provided to a user only via parsing of the whole corpus. That's why main page is so slow after one adds some corpora to the app.

When slowness is a lesser problem, the bigger one is that loading corpus into Annotatrix requires incommensurable amount of RAM.

The biggest file I managed to upload is this of 24.2 MB, it contains 16 809 sentences. It is impossible to load bigger corpus because it hits V8 heap size limit (1.4 GB).

On my laptop (not super new, though, but it is closer to average user environment) after uploading, it takes about a minute till corpus is loaded into interface for editing, meanwhile in peak NodeJS consumes 986 MB of RAM and Firefox does 2.1 GB, during the process browser show this alert several times: fox-close-Annotatrix

There are couple of options to deal with this issue:

run NodeJS with specific key increasing heap threshold, let's say to 2 or 3 GB. However, it wouldn't make the app running quicker, and maybe an issue for average user with system not having to much RAM.
restrict limit of a file that could be loaded into Annotatrix, like 10 MB. But I am not sure which size is typical for a corpus, maybe they are bigger.
change Corpus/Sentence format inside notatrix. But both objects lay in the basis of notatrix and annotatrix, and it's hard to predict how many changes in a code base of both packages it would cause.

yaskevich commented 5 years ago

Filled an issue for notatrix: https://github.com/keggsmurph21/notatrix/issues/6

keggsmurph21 commented 5 years ago

Hi sorry to chime in at the 11th hour, but I think there's actually a fourth option: build some sort of parser/database pipeline that can parse the sentences one-by-one and then offload the parsed data onto disk (i.e. in the database). I actually started work on this sometime last year but didn't get all that far.

I could try to explain my thinking a bit more sometime, but I'm about to board a flight to Hong Kong. Maybe it would be helpful to take a look at this repo.

jonorthwash / ud-annotatrix

File handling & issues related to big files uploading #377