kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
250 stars 24 forks source link

Slow loading of the Wikidata .bz2 dump #105

Open kermitt2 opened 4 years ago

kermitt2 commented 4 years ago

The Wikidata dump became very big with 1.2 billion statements which makes the initial loading of the bz2 dump into lmdb particularly slow.

To speed-up this step, we could try:

kermitt2 commented 4 years ago

Complementary info:

The good point is that the increase of Wikidata volume does not affect runtime, just the storage size.

oterrier commented 4 years ago

Hi Patrice According to https://www.wikidata.org/wiki/Wikidata:Statistics, one of the main reason of the explosion of the statement db is that recently most of the published scientific articles have now an entry in wikidata They currently represent more than 22M concepts out of 71M I understand the interest to be able to build graphs between authors and articles but it is not very interesting for entity fishing given that these scholary articles have no wikipedia pages associated and have long titles that cannot be recognized by the current EF mention recognizers. Take the "Attention Is All You Need" paper for example : https://www.wikidata.org/wiki/Q30249683 So one possible optimization of the statement db size would be to be able to filter out some classes ("scholary article" being one of them) when initially building the lmdb database Let's imagine you can define such filtering constraint somewhere (or had code them?) for example in the kb.yaml file:

#dataDirectory: /home/lopez/resources/wikidata/

# Exclude scholary articles from statement db
excludedConceptStatements:
  - conceptId:
    propertyId: P31
    value: Q13442814

When filling the statement db if I detect a concept meeting the constraint ("instance of" "scholary article" for example) then I forget this concept and I don't store the statements

            if ((propertytId != null) && (value != null)) {
            if (excludedConceptStatements != null) {
                for (Statement excludedConceptStatement : excludedConceptStatements) {
                    exclude = (excludedConceptStatement.getConceptId() == null || excludedConceptStatement.getConceptId() == itemId) &&
                            (excludedConceptStatement.getPropertyId() == null || excludedConceptStatement.getPropertyId() == propertytId) &&
                            (excludedConceptStatement.getValue() == null || excludedConceptStatement.getValue() == value);
                    if (exclude)
                        break;
                }
            }
            Statement statement = new Statement(itemId, propertytId, value);
//System.out.println("Adding: " + statement.toString());
            if (!statements.contains(statement))
                statements.add(statement);
        }
...
...
            if (statements.size() > 0 && !exclude) {
                try {
                    db.put(tx, KBEnvironment.serialize(itemId), KBEnvironment.serialize(statements));
                    nbToAdd++;
                    nbTotalAdded++;
                } catch(Exception e) {
                    e.printStackTrace();
                }
            }

I think we can considerably reduce the size of the statement db

I can even propose a PR for such a mechanism

Best regards Olivier

lfoppiano commented 1 day ago

Hi @oterrier,

I understand the interest to be able to build graphs between authors and articles but it is not very interesting for entity fishing given that these scholary articles have no wikipedia pages associated and have long titles that cannot be recognized by the current EF mention recognizers.

There is an option in the version 0.6 of entity-fishing that should ignore wikidata concepts that have not associated page, in kb.yaml:

# if true, the statements are loaded only for concepts having at least one
# Wikipedia page in a supported language
restrictConceptStatementsToWikipediaPages: true

Are you using 0.6 I suppose, did you notice any improvement in size?

For the paper you cited, it has probably changed afterwards this ticket was created, but they got a wikipedia page, so the possibilty of ignoring some specific categories can be done at configuration, as you proposed.

oterrier commented 20 hours ago

Hi @lfoppiano I didn't know this new feature and of course it is of interest. Nevertheless I think the filtering of targeted concepts is key too.