node runs out of memory on very large input

akonsu commented 7 years ago

I have a script mkexpr.js that generates a very big expression:

/* global require */

const fs = require('fs');

fs.writeFile('test.wl', mkexpr(1500), err => {
    if (err) {
        console.log(err);
    } else {
        console.log('success');
    }
});

function mkexpr(count) {
    let prefix = '';
    let suffix = '';

    for (let i = 0; i < count; i++) {
        prefix += `With[{x${i} = ${i}},`;
        suffix += ']';
    }
    return `${prefix}${[...Array(count)].map((_, i) => `x${i}`).join('+')}${suffix}`;
}

My grammar wl-grammar.ne is:

@builtin "whitespace.ne"
@builtin "number.ne"

expr -> identifier _ "[" _ "{" _ identifier _ "=" _ int _ "}" _ "," _ expr _ "]"
            {% d => `${d[0]}[{${d[6]}=${d[10]}},${d[16]}]` %}
      | identifier _ "+" _ expr
            {% d => `${d[0]}+${d[4]}` %}
      | identifier
            {% id %}

identifier -> [a-zA-Z]:+ [0-9]:* {% d => `${d[0].join('')}${d[1].join('')}` %}

Then I do node mkexpr.js && nearleyc.js wl-grammar.ne -o wl-grammar.js && more test.wl | nearley-test.js wl-grammar.js

This produces an error:

...
468: {expr → identifier _ "+" _ expr ● }, from: 30465
469: {expr → identifier _ "+" _ expr ● }, from: 30460
470: {expr → identifier _ "+" _ expr ● }, from: 30455

<--- Last few GCs --->

[95384:0x102801600]   112889 ms: Mark-sweep 1400.9 (1434.9) -> 1400.9 (1433.9) MB, 1591.6 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1592 ms) last resort 
[95384:0x102801600]   114390 ms: Mark-sweep 1400.9 (1433.9) -> 1400.9 (1433.9) MB, 1501.4 / 0.0 ms  last resort 

<--- JS stacktrace --->

==== JS stack trace =========================================

    2: arguments adaptor frame: 3->1
Security context: 0x2c9ebd3a66a1 <JS Object>
    3: map [native array.js:~833] [pc=0x3bbc530dd9e2](this=0xafd6daf4861 <JS Array[5]>,bk=0x17cbb35025f1 <JS Function stringifySymbolSequence (SharedFunctionInfo 0x781fd2964e9)>,bl=0x24eaa9d82311 <undefined>)
    4: arguments adaptor frame: 1->2
    5: toString [/Users/ksushenko/projects/cloudplatform/src/webapp/node_modules/nearley/lib/nearley.js:~18] [pc=0...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/usr/local/bin/node]
 2: node::FatalException(v8::Isolate*, v8::Local<v8::Value>, v8::Local<v8::Message>) [/usr/local/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/usr/local/bin/node]
 4: v8::internal::Factory::NewFixedArray(int, v8::internal::PretenureFlag) [/usr/local/bin/node]
 5: v8::internal::TypeFeedbackVector::New(v8::internal::Isolate*, v8::internal::Handle<v8::internal::TypeFeedbackMetadata>) [/usr/local/bin/node]
 6: v8::internal::SharedFunctionInfo::FindOrCreateLiterals(v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>) [/usr/local/bin/node]
 7: v8::internal::JSFunction::EnsureLiterals(v8::internal::Handle<v8::internal::JSFunction>) [/usr/local/bin/node]
 8: v8::internal::Compiler::Compile(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag) [/usr/local/bin/node]
 9: v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/local/bin/node]
10: 0x3bbc52f043a7
Abort trap: 6

Perhaps this is related to your work on eliminating the exponential memory consumption when working with ambiguous grammars?

tjvr commented 7 years ago

Perhaps this is related to your work on eliminating the exponential memory consumption when working with ambiguous grammars?

No, sorry; the parsing algorithm in master has been stable for a while now (since I did a big push of making things faster).

I’m afraid I don’t have time to look through your grammar, but I suspect it is, in fact, ambiguous :-)

The other option is that it’s right-recursive, but I don’t think that’s the case here.

I hate to say it, but have you tried switching to a tokenizer? It tends to help with performance issues!

_{Sent with GitHawk}

akonsu commented 7 years ago

Yes, it is ambiguous, and maybe recursive. That was a part of the test. I wanted to see how much it can handle. And, yes, I am going to switch to a tokenizer. Do you mean that there is not much that can be done in the code to alleviate the memory issues? Other than being careful when writing grammars?

tjvr commented 7 years ago

Ah, your grammar is definitely right-recursive, due to the expr -> identifier _ "+" _ expr rule. Earley doesn't like right-recursive rules very much; one of the things on my to-do list is to re-implement Leo's optimisation, which should help with this. For now, you should prefer left-recursive rules where possible (e.g. expr -> expr _ "+" _ identifier), since they perform better. I think that's the problem you're having here.

I don't think it's relevant here, but a note on ambiguity anyway: While Nearley does provide all parsings for ambiguous grammars/inputs, we don't (yet) use an efficient representation (the preferred one is a "packed parse forest"). Since we end up producing an array containing each parse, performance degrades the more ambiguity you have. Generally this isn't a problem, since you want to aim to write an unambiguous grammar!

kach / nearley

node runs out of memory on very large input #312