sfu-natlang / lensingwikipedia

Lensing Wikipedia is an interface to visually browse through human history as represented in Wikipedia. This the source code that runs the website:
http://lensingwikipedia.cs.sfu.ca
Other
11 stars 4 forks source link

Person names incorrect? #28

Closed anoopsarkar closed 11 years ago

anoopsarkar commented 11 years ago

The person names are sometimes incorrect in the json file dumped by the crawler.

e.g. "Leonardo daVinci Mona Lisa" is presented as a single person name, even though the event does not have these words in sequence anywhere. there are separate entries for "Leonardo DaVinci" and "Mona Lisa".

This happens very rarely it seems at least the current subset of the data that is in data-small.

msiahbani commented 11 years ago

this is the event in http://en.wikipedia.org/wiki/1962:

The Mona Lisa by Leonardo da Vinci Mona Lisa was assessed for insurance purposes at US$100 million, before the painting toured the United States for several months. It is the highest insurance value for a painting in history. However, the Louvre chose to spend the money that would have been spent on the insurance premium on security instead.

"Leonardo da Vinci Mona Lisa" is recognized a person by NER, although there are two different links for "Leonardo da Vinci" and "Mona Lisa" (found by crawler).

I am still not sure how should I treat such issues, but will work on it.

anoopsarkar commented 11 years ago

Seems like missing punctuation. If it is an NER error then let it be. When we scale to the full data these cases should not matter much. Go ahead and close the issue. On May 13, 2013 11:48 PM, "Maryam Siahbani" notifications@github.com wrote:

this is the event in http://en.wikipedia.org/wiki/1962:

The Mona Lisa by Leonardo da Vinci Mona Lisa was assessed for insurance purposes at US$100 million, before the painting toured the United States for several months. It is the highest insurance value for a painting in history. However, the Louvre chose to spend the money that would have been spent on the insurance premium on security instead.

"Leonardo da Vinci Mona Lisa" is recognized a person by NER, although there are two different links for "Leonardo da Vinci" and "Mona Lisa" (found by crawler).

I am still not sure how should I treat such issues, but will work on it.

— Reply to this email directly or view it on GitHubhttps://github.com/sfu-natlang/lensingwikipedia/issues/28#issuecomment-17859508 .