semanticize / semanticizest

Standalone Semanticizer
Apache License 2.0
32 stars 15 forks source link

Nested (infobox) templates result in empty pages #23

Open IsaacHaze opened 10 years ago

IsaacHaze commented 10 years ago

There are no links found for the sample pages "Andre Agassi" or "Groen (partij)".


In [1]: from semanticizest.parse_wikidump import clean_text

In [2]: clean_text("""{{ def }}abc""")
Out[2]: 'abc'

In [3]: clean_text("""{{ def {{123}} }}abc""")
Out[3]: ' }}abc'

In [4]: clean_text("""{{ def
   ...:    ...: | asd = [[34]]
   ...:    ...: | wqe = {{be|blaat}}
   ...:    ...: | vrouwen = 
   ...:    ...: }}
   ...:    ...: [[nep:perd|0px]]
   ...:    ...: abc
   ...:    ...: """)
Out[4]: '\n'

The _UNWANTED regex needs tweaking...

larsmans commented 10 years ago

Meh... stupid nested wikisyntax...