Closed alvaromorales closed 9 years ago
Thank you! This data is really useful, so glad to hear it will be auto-generated with each re-index. TSV should be more machine-readable.
This issue should probably be in WikiMap because that is where the library for reading the excel file (or TSV) is. I import WikiMap into WikiThingsDB for this feature.
We should probably discuss (maybe in a separate issue) where best to store data that is used in code, for example the HTML dumps in defexpland and this infoboxes file. They should be stored and downloadable from some place, not stored in the GitHub repo, I think. Thoughts?
On Aug 11, 2015, at 10:37 PM, Alvaro Morales notifications@github.com wrote:
I noticed you added the infoboxes.xlsx spreadsheet to the repo. I created that file manually a few months ago for quick analysis of infobox frequency. I wasn't really intending for it to be machine-readable.
Now that the data turned out to be useful for wikithingsdb and wikimap, I'll make elasticstart generate this file automatically on every reindex. Because of simplicity, I'll output a tab-separated file with the following format:
class count settlement 354090 person 102384 The tsv format should simplify the way you read in data.
— Reply to this email directly or view it on GitHub.
Aren't you going to import WikiThingsDB in WikiMap? Circular dependency issue again. Perhaps it would be better to move it here. My understanding of WikiThingsDB is that it will be a repository that provides access to all data related to Wikipedia. Applications such as WikiMap can build on top of WikiThingsDB.
It will get tricky to manage multiple repositories, though.
Now I don't plan on importing WikiMap because I don't have any good reason. I can get all infoboxes straight from the WikiExtractor's output. This should now be an issue in WikiMap instead.
This issue was moved to infolab-csail/wikimap#7
FYI, I moved the GitHub issue using https://github-issue-mover.appspot.com/
I noticed you added the infoboxes.xlsx spreadsheet to the repo. I created that file manually a few months ago for quick analysis of infobox frequency. I wasn't really intending for it to be machine-readable.
Now that the data turned out to be useful for wikithingsdb and wikimap, I'll make elasticstart generate this file automatically on every reindex. Because of simplicity, I'll output a tab-separated file with the following format:
The tsv format should simplify the way you read in data.