soerenmeier / parse-wiki-text-2

MIT No Attribution
5 stars 5 forks source link

Parsing has worst case exponential runtime #1

Open xmakro opened 1 year ago

xmakro commented 1 year ago

Thank you for reviving this crate. I noticed that this parser has a worst case exponential runtime of O(2^n). The following is an example input that showcases this behavior:

use parse_wiki_text_2::Configuration;

fn main() {
    Configuration::default().parse("{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{");
}

In this example, we keep opening template tags with {{ but there is never any }} to close the template tags. Unfortunately all wiki markup is valid and the parser has to handle such input. The parser implementation of this crate has a backtracking logic for the unclosed tags when reaching the end of the document (see code in parser). Since we have to rewind the tags for all opening tags, the runtime becomes exponential and the parser never finishes for the example input. Unfortunately, this is not a hypthetical scenario, but this occurs when parsing real documents from Wikipedia.

soerenmeier commented 1 year ago

Thanks to bringing this to my attention. What would be the expected result? Some runtime error like Error::TooManyRewinds? Or should everything be treated as text?

xmakro commented 1 year ago

In this example, everything should be treated as text. You can preview the reference outputs by pasting the input into the official wiki markup to html conversion editor.

This is also what this crate does for smaller inputs. Example:

fn main() {
    let x = parse_wiki_text_2::Configuration::default().parse("[[[[[[[[[");
    println!("{x:?}");
}

Correctly outputs:

Output { nodes: [Text { end: 9, start: 0, value: "[[[[[[[[[" }], warnings: [...] }

This is thanks to this section in the code. Unfortunately, the backtrack logic takes too long if there are too many unclosed tags.

erictapen commented 10 months ago

I think I'm running into this problem when trying to parse the current Wikipedia article of Michael Bryant (actor). The article has a "Stage credits" section, that contains 29 elements like this:

! scope="row" | ''[[The Homecoming]]''
| {{dts|format=dmy|1965|June}}
| {{sort|[[Aldwych Theatre]]
| Teddy
|
|-

Note the unclosed {{ in the third line.

I didn't manage to parse the full table on my CPU, but the time growth is definitely exponential (y axis in seconds). At 29 entries, the table would need about an hour on my CPU.

tmp 7VsNLM8zj9

Closing the {{ tokens solves the problem, but this seems to be rendered correctly in Mediawiki.

gbogard commented 9 months ago

Hi! I work at Qwant, a search engine company, where we use an internal of fork this crate to parse Wikipedia pages. We have a working fix for the article Michael Bryant (actor) and we are willing to contribute it you think it's a good idea. I have opened a pull request (#2)

This particular article has unclosed templates inside of table rows, and it seems that Mediawiki is okay with it, presumably because unclosed templates are implicitly closed when their parent element closes. The fix we have implemented so far is this:

In parse.rs, when the | character is matched and the element at the top of the stack is an OpenNode of type Template, instead of immediately calling parse_template_separator(&mut state), we read the next character. If the next character is anything but -, we call parse_template_separator(&mut state) just like before; but if the character is -, it probably means the element to be parsed is not a template separator but rather a table row, so we call a close_template_implicitly function that pops the open template from the stack, but leaves the scan_position unchanged, emit a warning about the enclosed template that needed implicit closure, and then parse the table element using table::parse_inline_token(&mut state).

This fix allows us to parse the Michael Bryant article successfully in a matter of milliseconds. Presumably, there are other situations where we would need to close open templates implicitly, other elements that should act as template boundaries.

We have only implemented this for table rows so far, and doesn't quite solve the exponential time issue at the root, but we think it's still an improvement.


Also related to this issue, we have arbitrarily limited the number of loop iterations in parse.rs to a configurable amount of time spent on the document. Every 500 iterations, we check the amount of time we have spent parsing the document, and force a rewind whenever this time exceeds the configured limit. I could also contribute this if you are interested, but frankly this is rather hacky, it would be better to fix the root cause of this exponential use of CPU time.

emmalexandria commented 8 months ago

I think I'm running into this problem when trying to parse the current Wikipedia article of Michael Bryant (actor). The article has a "Stage credits" section, that contains 29 elements like this:

! scope="row" | ''[[The Homecoming]]''
| {{dts|format=dmy|1965|June}}
| {{sort|[[Aldwych Theatre]]
| Teddy
|
|-

Note the unclosed {{ in the third line.

I didn't manage to parse the full table on my CPU, but the time growth is definitely exponential (y axis in seconds). At 29 entries, the table would need about an hour on my CPU.

tmp 7VsNLM8zj9

Closing the {{ tokens solves the problem, but this seems to be rendered correctly in Mediawiki.

I'm having the exact same issue with that article, and a few others! I made a list in #3, but if they're the same issue I might as well close that issue and post my 'research' here. Here are the set of articles I've found that have this problem so far:

I'm currently parsing literally all of English Wikipedia, so as I find more I can update this list.

emmalexandria commented 8 months ago

Also related to this issue, we have arbitrarily limited the number of loop iterations in parse.rs to a configurable amount of time spent on the document. Every 500 iterations, we check the amount of time we have spent parsing the document, and force a rewind whenever this time exceeds the configured limit. I could also contribute this if you are interested, but frankly this is rather hacky, it would be better to fix the root cause of this exponential use of CPU time.

While it is hacky, I think that maybe implementing some kind of configurable max time would be hugely helpful while a better fix is worked on. I don't know how reusable the code of the parse function is, but it'd be great to have something like parse_with_timeout which takes a max number of ms. The main issue I see is that it would be an addition to the API and a breaking change to remove once the issue is fixed.

soerenmeier commented 8 months ago

You're right. As a first fix the execution time should be capped.

emmalexandria commented 8 months ago

You're right. As a first fix the execution time should be capped.

I have a basic working implementation in a fork right now. I can submit a PR tomorrow.