Question about the dataset statistic

shuxiaobo commented 5 years ago

Hi @rloganiv : I'm very thankful for your contribution to this data collection repository. It's very helpful for language model research. I want to build a more lager dataset to train my model now. But I don't know how many data(number of unique entities or mention spans) and how long will it costs while processing a complete wikipedia dump(e.g. enwiki-20190520-pages-articles-multistream.xml.bz2 15.9 GB). The dataset in [1] seems like post-processed of wikitext-2 which is not a full wikipedia dump. At your convenience, would you please tell me the dataset statistic and time costs for a complete wikipedia dump if you ever done for that? Thank you for your assistance. Looking forward to your reply. Thanks. [1]. Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

rloganiv commented 5 years ago

Hi @shuxiaobo,

I have not attempted to run the dataset processing code on a full Wikipedia dump so I cannot provide exact numbers. There are a couple issues I can expect you would encounter:

The Wikipedia dump you posted above is an XML dump. When creating the Linked WikiText-2 dataset I found these dumps extremely difficult to work with since the articles are written in a markup language (confusingly also called wikitext) which allows for arbitrary macros that are difficult to render as text. Instead of processing the wiki markup we opted to process the article HTML which you can obtain using the Wikipedia API (see this script). Querying the API for every Wikipedia article would however be very time consuming.
The number of unique entities would be quite large, and accordingly the amount of the Wikidata knowledge graph you would need access to is also quite large - probably too large to fit in memory, which means you would probably need to store/access the knowledge graph on disk. This would also greatly increase the amount of time required to annotate the data.

You will probably need to modify our codebase in order to address these issues (particularly the second one).

shuxiaobo commented 5 years ago

Hi @rloganiv Thanks for your reply.

About the script. I have no idea about the content and format about input file :|
If we extract the entities from wikipedia passages, Do we must choice the HTML dump?(recognize entity by its hyper-link). I've found the HTML dumps [here](Thanks for your reply. )
I've found that the wikidata KG is a huge monster(54G compressed, I think it will be more than 200G size after uncompressed). Now I have some primary ideas to solve the problem. There are too much useless information in the wikidata dump which have no need to load them to RAM. Like below:
```
{
"id": "Q60",
"type": "item",
"labels": {},
"descriptions": {},
"aliases": {},
"claims": {},
"sitelinks": {},
"lastrevid": 195301613,
"modified": "2015-02-10T12:42:02Z"
}
```
I'm not sure whether we need the description or claims, but we can discard the sitelinks? And discard some kinds of entities or low-frequency entities in the wikidata KG. That's all. What is your opinion on the schedule? Is it feasible ? Looking forward to your reply. Thanks.

rloganiv commented 5 years ago

Yes. My annotation scripts can only process the article HTML. The link you posted above is broken, but if you plan on using these dumps be aware that they are quite outdated (> 10 years old).
You do not need descriptions but do need: -claims which specify the edges between entities. -sitelinks whichs maps Wikidata to Wikipedia (you only need the enwiki field though). -aliases which gives the set of strings that refer to the entity.

In my opinion annotating all of Wikipedia is ambitious. Dealing with the large knowledge graph will be difficult - the existing code in this repository is probably not efficient enough to be feasible.

rloganiv commented 5 years ago

Oh and I forgot to answer your question about the input file structure. It is a JSON lines file where each line contains an object with a Wikipedia article title. e.g.

{"title": "Toni_Morrison"}
{"title": "Alan_Turing"}
...

rloganiv commented 5 years ago

Hi @shuxiaobo has this discussion resolved your issue?

shuxiaobo commented 5 years ago

yes, Thanks for your help ! @rloganiv

rloganiv / kglm-data

Question about the dataset statistic #1