infolab-csail / wikithingsdb

A DB of Synonyms, Paraphrases, and Hypernyms for all Wiki Things (Articles)
3 stars 3 forks source link

Use tsv file instead of excel for infobox data #1

Closed alvaromorales closed 9 years ago

alvaromorales commented 9 years ago

I noticed you added the infoboxes.xlsx spreadsheet to the repo. I created that file manually a few months ago for quick analysis of infobox frequency. I wasn't really intending for it to be machine-readable.

Now that the data turned out to be useful for wikithingsdb and wikimap, I'll make elasticstart generate this file automatically on every reindex. Because of simplicity, I'll output a tab-separated file with the following format:

class               count
settlement          354090
person              102384

The tsv format should simplify the way you read in data.

michaelsilver commented 9 years ago

Thank you! This data is really useful, so glad to hear it will be auto-generated with each re-index. TSV should be more machine-readable.

This issue should probably be in WikiMap because that is where the library for reading the excel file (or TSV) is. I import WikiMap into WikiThingsDB for this feature.

We should probably discuss (maybe in a separate issue) where best to store data that is used in code, for example the HTML dumps in defexpland and this infoboxes file. They should be stored and downloadable from some place, not stored in the GitHub repo, I think. Thoughts?

On Aug 11, 2015, at 10:37 PM, Alvaro Morales notifications@github.com wrote:

I noticed you added the infoboxes.xlsx spreadsheet to the repo. I created that file manually a few months ago for quick analysis of infobox frequency. I wasn't really intending for it to be machine-readable.

Now that the data turned out to be useful for wikithingsdb and wikimap, I'll make elasticstart generate this file automatically on every reindex. Because of simplicity, I'll output a tab-separated file with the following format:

class count settlement 354090 person 102384 The tsv format should simplify the way you read in data.

— Reply to this email directly or view it on GitHub.

alvaromorales commented 9 years ago

Aren't you going to import WikiThingsDB in WikiMap? Circular dependency issue again. Perhaps it would be better to move it here. My understanding of WikiThingsDB is that it will be a repository that provides access to all data related to Wikipedia. Applications such as WikiMap can build on top of WikiThingsDB.

It will get tricky to manage multiple repositories, though.

michaelsilver commented 9 years ago

Now I don't plan on importing WikiMap because I don't have any good reason. I can get all infoboxes straight from the WikiExtractor's output. This should now be an issue in WikiMap instead.

michaelsilver commented 9 years ago

This issue was moved to infolab-csail/wikimap#7

michaelsilver commented 9 years ago

FYI, I moved the GitHub issue using https://github-issue-mover.appspot.com/