Exponentially increasing runtime and high memory usage for CSV parser

jl6 commented 1 year ago

I am attempting to build a CSV parser based on RFC 4180:

import sys
from lark import Lark

f = open(sys.argv[1], "rb")

csv_parser = Lark(r"""
    file: [header CRLF] record (CRLF record)* [CRLF]
    header: name (COMMA name)*
    record: field (COMMA field)*
    name: field
    field: (escaped | nonescaped)
    escaped: DQUOTE (TEXTDATA | COMMA | CR | LF | DDQUOTE)* DQUOTE
    nonescaped: TEXTDATA*
    TEXTDATA: /[^",]/
    COMMA: ","
    DQUOTE: /"/
    DDQUOTE: /""/
    LF: "\n"
    CR: "\r"
    CRLF: CR LF | CR | LF
""", start="file", use_bytes=True).parse(f.read())

Lark quickly and successfully builds this parser, and it appears to correctly parse small CSV files. However, for larger files (still not very large in the scheme of things - say, 600KB), the runtime increases greatly, and memory usage balloons to multiple gigabytes, until memory is exhausted and the process is killed.

MegaIng commented 1 year ago

Yes, earley parser has O(n^3) scaling in runtime (and probably also in space, although I am not actually sure about that). Use parser="lalr" instead. If that doesn't work, you will need to rework your grammar to work with the LALR(1) algorithm, which should defeinetly be possible for CSV. earley just isn't really designed for such large inputs.

But actually, you should just be using a custom purpose parser for a format as simple as CSV, for example the stdlib csv module.

jl6 commented 1 year ago

Thanks for the quick response MegaIng.

lark-parser / lark

Exponentially increasing runtime and high memory usage for CSV parser #1340