Running out of RAM for large files (100k+ lines)

noahssarcastic commented 2 years ago

I'm using Ohm for a work project where I've written a grammar for a geosciences file-type called .grdecl. Files in this format can commonly be 100k+ lines. I'm succeeding in parsing smaller dummy files, but when I try to parse large files, I'm running out of RAM. I've upgraded the process's memory cap to 8GB but have still been unable to parse a file of ~150k lines.

pdubroy commented 2 years ago

Hi @noahssarcastic! Unfortunately, memory usage is a known issue with the current Ohm implementation. In general packrat parsing (which is the parsing algorithm Ohm uses) uses a lot of memory due to the memo table. And our main focus with Ohm has been on ease of use, rather than performance.

That said, there are some opportunities for reducing memory usage that I tried in the past but never ended up landing in the code base. Let me see if I can get some of them working again.

pdubroy commented 2 years ago

How big are the files you're trying to parse exactly (in bytes)?

noahssarcastic commented 2 years ago

One that I just tried today is ~11kB. This one is from an open dataset so I can share it.

https://www.sintef.no/Projectweb/MatMoRA/Downloads/Johansen/

Looks like they have "smaller" versions of the dataset but it appears they just attach a text file with indices of cells to ignore.

pdubroy commented 2 years ago

Thanks! Can you also share your grammar, and give me the command line that you are using?

11kB is not very large, so if it's OOMing then it can probably be fixed by changes to the grammar.

noahssarcastic commented 2 years ago

Grid {
    GridFile = Block*

    Block = Title Body

    Title = upper+

    End = "/"

    Body = (~End option)+ End

    option = (digit | letter | ".")+

    // line endings
    eol = "\r\n" -- crlf
        | "\n" -- lf
        | end -- end

    comment = "--" (~eol any)* eol

    // ignore comments
    space += comment

}

One of the Blocks is commonly over half of those lines.

My script:

const fs = require('fs')
const path = require('path')

const ohm = require('ohm-js')

const parserFile = fs.readFileSync(
    path.join(__dirname, 'gridParser.ohm'),
    'utf-8'
)
const parser = ohm.grammar(parserFile)

const gridFile = fs.readFileSync(
    path.join(__dirname, '..', 'data', 'SPE9.GRDECL'),
    'utf-8'
)

const semantics = parser.createSemantics().addOperation('toJson', {
    GridFile(iter) {
        const blocks = iter.children.map((block) => block.toJson())
        return blocks.reduce(
            (prev, block) => ({ ...prev, [block.title]: block.body }),
            {}
        )
    },
    Block(title, body) {
        const titleString = title.toJson()
        return {
            title: titleString,
            body: body.toJson(),
        }
    },
    Title(upperIter) {
        const upperString = upperIter.children.map((upper) => upper.toJson())
        return upperString.join('').toLowerCase()
    },
    Body(bodyIter, _end) {
        return bodyIter.children.map((option) => option.toJson())
    },
    option(charIter) {
        const option = charIter.children
            .map((character) => character.sourceString)
            .join('')
        const asFloat = Number.parseFloat(option)
        if (Number.isNaN(asFloat)) {
            return option
        }
        return asFloat
    },
    _iter() {
        return this.children.map((child) => child.toJson())
    },
    _terminal() {
        return this.sourceString
    },
})

const match = parser.match(gridFile)
if (match.succeeded()) {
    console.log('Success')
    const grid = semantics(match).toJson()
    try {
        var json = JSON.stringify(grid)
        fs.writeFileSync(path.join(__dirname, '..', 'data', 'grid.json'), json)
        console.debug('Wrote')
    } catch (err) {
        console.error(err)
    }
} else {
    console.log('Fail')
    console.debug(match.message)
}

noahssarcastic commented 2 years ago

Also, I would be really interested in finding out more about how Ohm grammars themselves can be optimized for performance.

noahssarcastic commented 2 years ago

@pdubroy I originally said kB but I meant MB I apologize. 11MB.

pdubroy commented 2 years ago

Realistically I don't think it will be easy to reduce Ohm's memory usage to be able to parse files that large.

Given the simplicity of this grammar, I'd suggest a hand-written recursive-descent parser.

ohmjs / ohm

Running out of RAM for large files (100k+ lines) #343