infolab-csail / defexpand

Generalizes definitions using DBpedia ontology and WordNet
2 stars 0 forks source link

Add geoboxes, etc. as infoboxes #7

Open michaelsilver opened 9 years ago

michaelsilver commented 9 years ago

In the extract_infoboxes() function, change the regexp that matches infoboxes to detect geoboxes, and any other variations on the infobox theme as well. This has a direct effect of what classes get stored in the WikithingsDB, so I'd like to maximize what types of 'boxes get matched.

I believe @fakedrake had the magic regexp to match all versions of infoboxes, geoboxes, etc. in WikipediaBase, but I can't seem to find it now. Anyone know the best regexp for this?

alvaromorales commented 9 years ago

From WikipediaBase/infobox.py:

# Various names under which you may find an infobox
box_rx = ur"\b(infobox|Infobox|taxobox|Taxobox)\b"
michaelsilver commented 9 years ago

Awesome, thanks! Any reason not to include geoboxes? (As in the East River).

alvaromorales commented 9 years ago

That's a to-do, see https://github.com/infolab-csail/WikipediaBase/issues/49