molybdenum-99 / infoboxer

Wikipedia information extraction library
MIT License
174 stars 16 forks source link

Hangs when parsing an invalid table #78

Closed robfors closed 4 years ago

robfors commented 4 years ago

When I try to parse the english Wikipedia page for Arthashastra, Infoboxer currently hangs and consumes memory until the system runs out of memory and the process is killed.

This can currently be reproduced with

Infoboxer.wikipedia.get('Arthashastra')

\ I have narrowed the problem to the table found in the article's Organisation section. After simplifying the table I am still able to reproduce the behavior with

{| 
 |+ ''A'' |
 ! B
|}

\ It does not hang if I remove the second pipe on the second line, of which I think may be invalid wikitext. Also does not hang when I remove the italics. I am no wikitext expert so I will leave further analysis to someone else.

I am using Infoboxer 0.3.2

zverok commented 4 years ago

Thanks for the report! It seems that table captions never really had worked properly :thinking: (Though, this particular example was much trickier than just make them work as expected) I think I fixed all problems with captions (including the "broken caption" of the example page, and general guarding against infinite loops) in current master, can you please check?

robfors commented 4 years ago

Its working great now. Thanks for the quick fix!