Open michaelsilver opened 9 years ago
From WikipediaBase/infobox.py
:
# Various names under which you may find an infobox
box_rx = ur"\b(infobox|Infobox|taxobox|Taxobox)\b"
Awesome, thanks! Any reason not to include geoboxes? (As in the East River).
That's a to-do, see https://github.com/infolab-csail/WikipediaBase/issues/49
In the
extract_infoboxes()
function, change the regexp that matches infoboxes to detect geoboxes, and any other variations on the infobox theme as well. This has a direct effect of what classes get stored in the WikithingsDB, so I'd like to maximize what types of 'boxes get matched.I believe @fakedrake had the magic regexp to match all versions of infoboxes, geoboxes, etc. in WikipediaBase, but I can't seem to find it now. Anyone know the best regexp for this?