Open Daniel-Mietchen opened 7 years ago
As a starting point, it would make sense to go for papers that are already on Wikidata and have a P932 (PMCID) statement.
The query for that is
SELECT ?item ?pmcid WHERE {
?item wdt:P31 wd:Q13442814;
wdt:P932 ?pmcid.
}
#LIMIT 100
Without the LIMIT command, this just took 6s and gave 334628 results, which sounds like a good maximal size for a test set.
Daniel-Mietchen and I discussed this with the possible outcomes of:
High-level strategy
Collect a corpus of Open articles and carry out supervised term analysis of the content, supported by #wikidata-enhanced dictionaries. Articles with a "main topic" which maps onto #Wikidata items (Q\d+) are likely to have many mentions of the main topic. For example article http://europepmc.org/articles/PMC2491585 mentions
and the most common terms (Bag of words) are:
We can infer that the main topic of the article is Dengue Virus and antigenicity. This is consistent with the title:
Conservation and variability of dengue virus proteins: implications for vaccine design.
The term "vaccine" occurs 16 times in the main text (whereas "HLA" and "peptide" - the mechanism of vaccination is emphasised.
Corpus of articles:
OK, I've added these to https://www.wikidata.org/wiki/Q24288762#P921 .
How can we scale that up? Can you provide a list of the following kind?
Reopening this, as we are still working on it.
A side project could be to identify the main subject(s) for journals — currently, ca. 40k instances of scientific journal do not have any main subject set in Wikidata Query:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q5633421.
MINUS {?item wdt:P921 ?mainsubject.}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
As per https://www.wikidata.org/wiki/Property_talk:P921 , P921 is for the "primary topic of a work", which should have a Wikidata entry. It doesn't have to be just one - for instance, the article Causal or not: applying the Bradford Hill aspects of evidence to the association between Zika virus and microcephaly (Q24261170) currently has Zika virus, microcephaly and Bradford Hill criteria.
On Sun, Mar 19, 2017 at 4:41 PM, Stefan Kasberger notifications@github.com wrote:
What do you exactly mean with the term "main subject"?
ContentMine can analyze papers in various ways, including as to what the most salient terms are, e.g. via https://en.wikipedia.org/wiki/Tf%E2%80%93idf .
It would be nice to harvest that to annotate Wikidata items about papers with the property P921 "main subject".