Open anoopsarkar opened 10 years ago
we should also be crawling the lists of lists on Wikipedia, e.g.
I have not still figured out how to use the artile "selected anniversaries articles" http://en.wikipedia.org/wiki/Category:Selected_anniversaries_articles
but I noticed that this article has a link to all historical anniversaries for all days of a year: http://en.wikipedia.org/wiki/List_of_historical_anniversaries
for example "June 6": http://en.wikipedia.org/wiki/June_6
it has a complete list of events happend on "June 6" (a super set of selected events in the article you have mentioned above http://en.wikipedia.org/wiki/Wikipedia:Selected_anniversaries/June_6?oldid=366403568)
It is interseting that some of these events has not been mentioned in the year articles. for example:
these events can be added to events crawled from year articles. but we still need to find a way to avoid overlaps, for example "Gustav Vasa" on June 6 has been mentioned in both year and day articles:
http://en.wikipedia.org/wiki/1523 June 6 – Gustav Vasa is elected king of Sweden, finally establishing its full independence from Denmark, marking the end of the Kalmar Union.
http://en.wikipedia.org/wiki/June_6 1523 – Gustav Vasa, the Swedish regent, is elected King of Sweden, marking a symbolic end to the Kalmar Union. This is the Swedish national day.
I would say don't worry about overlaps. Just include duplicates into the data for now. The examples you mention above are not quite identical anyway. As we expand the data we will see the same event multiple times and that is ok I think.
if you go to the main English Wikipedia page, you will see on the lower right part of the screen a section called "On this day ..."
This section and the archive associated with it, contains historical information about each day in history. Note that this gives us a date as well as year. It also has some links in bold, signifying the topic of that event. This could be useful as training data. It also links to the page for each year (that we have been crawling already).
A few clicks from the above archive you get this:
http://en.wikipedia.org/wiki/Category:Selected_anniversaries_articles
which leads to many pages which have the year as part of the title. it seems that we could crawl pages with the year in the title in order to augment our data. The first paragraph on this page would be a good summary of the article.
http://en.wikipedia.org/wiki/1982_British_Army_Gazelle_friendly_fire_incident
We can also generalize to any article with 'Date' in the infobox, and 'Location' as well. The article itself might contain lat-long as the above one does.
In the following page:
http://en.wikipedia.org/wiki/Wikipedia:Selected_anniversaries/June_6?oldid=366403568
notice that the link to the above article is in bold face, "friendly fire incident".
Perhaps we can extend our data gathering by including the above sources of information.