Close unclosed templates implicitly when a table row is encountered

gbogard commented 9 months ago

This is an improvement related to issue #1. It does not solve the exponential time problem in all cases, but allows the library to parse the article "Michael Bryant (actor)" and similarly-formatted articles. I have explained the fix proposal in more detail in a comment on the issue.

soerenmeier commented 9 months ago

Thanks for opening this PR.

This improves the parsing performance of Michael Bryant wiki page a lot. The Problem is, the following string will now parse differently than parsoid:

{|
|{{sort|Aldwych Theatre
|Name1
|-
|Name2}}
|}

Testcase passing in main:

case: {|\n|{{sort|Aldwych Theatre\n|Name1\n|-\n|Name2}}\n|}
node: [Table {attributes: [], captions: [], rows: [TableRow([], [TableCell(Ordinary, , [Template([Text(sort)], [Parameter(, [Text(Aldwych Theatre)]), Parameter(, [Text(Name1)]), Parameter(, [Text(-)]), Parameter(, [Text(Name2)])])])])]}]
warn: []

Since I currently don't fully understand the codebase, I'm not sure what the best way forward is to fix #1. If possible I would like to stay as close as possible to the behaviour of parsoid, I think this was the intention of Portström when he created this crate.

I'll think about it and will let you know how I want to proceed. Cheers

gbogard commented 9 months ago

Hi,

Thank you for taking the time to review this PR. While testing the library on a large sample of articles, I have realised that there was an issue with my code that caused the unreachable! statement in the get_table function in table.rs to be reached at runtime and caused the thread to panic.

My latest commit prevents this from happening by continuing to the next loop iteration after a template is implicitly closed, instead of calling the parsing functions from table.rs directly. This way, the type of the element at the top of the stack is checked before calling parse_inline_token.

However, while the library no longer panics, the fact that this bug happened in the first place means that my code incorrectly assumed a table was being parsed when it was another element. I still need to figure out which articles caused this and exactly why, and add tests.

In the meantime, it is perhaps best to close this PR. I still believe we should try to automatically close the templates left open by Wikipedia authors; however my current implementation of this idea isn't mature enough to be merged as is. I will continue to work on a better implementation :)

Cheers

soerenmeier / parse-wiki-text-2

Close unclosed templates implicitly when a table row is encountered #2