maxlath / wikibase-dump-filter

Filter and format a newline-delimited JSON stream of Wikibase entities
97 stars 15 forks source link

certain entities aren't filtered for unknown reason #27

Closed Selina-Mutz closed 4 years ago

Selina-Mutz commented 4 years ago

I tried this claim: --claim 'P31:Q515,Q7930989,Q15284' to get all cities and municipalities from around the world. I found that "Frankfurt am Main" isn't in the resulting file, even though it is an instance of "city" (Q515). "Frankfurt am Main" also has other items in the "instance of" property but they shouldn't be affecting the outcome, right? Also similar entities like "Munich", which also have multiple items in that property next to the "city" item and are in the resulting file. I noticed that the filter shows this after finishing: in: 1736 | total: 9762074 | last entity in: Q84908318. If I understand it correctly this means 1736 entities have been filtered from 9.7 Million. However, the resulting file has over 14 000 lines, of which each is an entity, right? How does this fit together?

maxlath commented 4 years ago

the problem was that the P31:Q515 claim for "Frankfurt am Main" (Q1794) has a normal rank, while other of its P31 claims have a preferred rank, making non-preferred claims be considered non-truthy (see https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_statements). The case you describe probably being the expected behavior in most cases, I made a patch (b13c7ca ) to include non-truthy claims in the filter test, it is now published as wikibase-dump-filter@v5.0.1 (beware of the module name change)

as for the counting problem, there as been some fixes and improvement in the last versions, please retry with the latest version and open a dedicated issue if that kind of problem is still happening