Closed noahssarcastic closed 2 years ago
Hi @noahssarcastic! Unfortunately, memory usage is a known issue with the current Ohm implementation. In general packrat parsing (which is the parsing algorithm Ohm uses) uses a lot of memory due to the memo table. And our main focus with Ohm has been on ease of use, rather than performance.
That said, there are some opportunities for reducing memory usage that I tried in the past but never ended up landing in the code base. Let me see if I can get some of them working again.
How big are the files you're trying to parse exactly (in bytes)?
One that I just tried today is ~11kB. This one is from an open dataset so I can share it.
https://www.sintef.no/Projectweb/MatMoRA/Downloads/Johansen/
Looks like they have "smaller" versions of the dataset but it appears they just attach a text file with indices of cells to ignore.
Thanks! Can you also share your grammar, and give me the command line that you are using?
11kB is not very large, so if it's OOMing then it can probably be fixed by changes to the grammar.
Grid {
GridFile = Block*
Block = Title Body
Title = upper+
End = "/"
Body = (~End option)+ End
option = (digit | letter | ".")+
// line endings
eol = "\r\n" -- crlf
| "\n" -- lf
| end -- end
comment = "--" (~eol any)* eol
// ignore comments
space += comment
}
One of the Blocks is commonly over half of those lines.
My script:
const fs = require('fs')
const path = require('path')
const ohm = require('ohm-js')
const parserFile = fs.readFileSync(
path.join(__dirname, 'gridParser.ohm'),
'utf-8'
)
const parser = ohm.grammar(parserFile)
const gridFile = fs.readFileSync(
path.join(__dirname, '..', 'data', 'SPE9.GRDECL'),
'utf-8'
)
const semantics = parser.createSemantics().addOperation('toJson', {
GridFile(iter) {
const blocks = iter.children.map((block) => block.toJson())
return blocks.reduce(
(prev, block) => ({ ...prev, [block.title]: block.body }),
{}
)
},
Block(title, body) {
const titleString = title.toJson()
return {
title: titleString,
body: body.toJson(),
}
},
Title(upperIter) {
const upperString = upperIter.children.map((upper) => upper.toJson())
return upperString.join('').toLowerCase()
},
Body(bodyIter, _end) {
return bodyIter.children.map((option) => option.toJson())
},
option(charIter) {
const option = charIter.children
.map((character) => character.sourceString)
.join('')
const asFloat = Number.parseFloat(option)
if (Number.isNaN(asFloat)) {
return option
}
return asFloat
},
_iter() {
return this.children.map((child) => child.toJson())
},
_terminal() {
return this.sourceString
},
})
const match = parser.match(gridFile)
if (match.succeeded()) {
console.log('Success')
const grid = semantics(match).toJson()
try {
var json = JSON.stringify(grid)
fs.writeFileSync(path.join(__dirname, '..', 'data', 'grid.json'), json)
console.debug('Wrote')
} catch (err) {
console.error(err)
}
} else {
console.log('Fail')
console.debug(match.message)
}
Also, I would be really interested in finding out more about how Ohm grammars themselves can be optimized for performance.
@pdubroy I originally said kB but I meant MB I apologize. 11MB.
Realistically I don't think it will be easy to reduce Ohm's memory usage to be able to parse files that large.
Given the simplicity of this grammar, I'd suggest a hand-written recursive-descent parser.
I'm using Ohm for a work project where I've written a grammar for a geosciences file-type called .grdecl. Files in this format can commonly be 100k+ lines. I'm succeeding in parsing smaller dummy files, but when I try to parse large files, I'm running out of RAM. I've upgraded the process's memory cap to 8GB but have still been unable to parse a file of ~150k lines.