sparcopen / open-research-doathon

Open Research Data do-a-thon in London & Virtual - March 4th & 5th
Other
37 stars 12 forks source link

Build ContentMine-based workflow for "main subject" of papers in Wikidata #51

Open Daniel-Mietchen opened 7 years ago

Daniel-Mietchen commented 7 years ago

ContentMine can analyze papers in various ways, including as to what the most salient terms are, e.g. via https://en.wikipedia.org/wiki/Tf%E2%80%93idf .

It would be nice to harvest that to annotate Wikidata items about papers with the property P921 "main subject".

Daniel-Mietchen commented 7 years ago

As a starting point, it would make sense to go for papers that are already on Wikidata and have a P932 (PMCID) statement.

The query for that is

SELECT ?item ?pmcid WHERE {
  ?item wdt:P31 wd:Q13442814;
        wdt:P932 ?pmcid.  
}
#LIMIT 100

Without the LIMIT command, this just took 6s and gave 334628 results, which sounds like a good maximal size for a test set.

petermr commented 7 years ago

Daniel-Mietchen and I discussed this with the possible outcomes of:

High-level strategy

Collect a corpus of Open articles and carry out supervised term analysis of the content, supported by #wikidata-enhanced dictionaries. Articles with a "main topic" which maps onto #Wikidata items (Q\d+) are likely to have many mentions of the main topic. For example article http://europepmc.org/articles/PMC2491585 mentions

and the most common terms (Bag of words) are:

We can infer that the main topic of the article is Dengue Virus and antigenicity. This is consistent with the title:

Conservation and variability of dengue virus proteins: implications for vaccine design.

The term "vaccine" occurs 16 times in the main text (whereas "HLA" and "peptide" - the mechanism of vaccination is emphasised.

Corpus of articles:

Daniel-Mietchen commented 7 years ago

OK, I've added these to https://www.wikidata.org/wiki/Q24288762#P921 .

How can we scale that up? Can you provide a list of the following kind?

Daniel-Mietchen commented 7 years ago

Reopening this, as we are still working on it.

Daniel-Mietchen commented 7 years ago

A side project could be to identify the main subject(s) for journals — currently, ca. 40k instances of scientific journal do not have any main subject set in Wikidata Query:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q5633421.
  MINUS {?item wdt:P921 ?mainsubject.}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

}
Daniel-Mietchen commented 7 years ago

As per https://www.wikidata.org/wiki/Property_talk:P921 , P921 is for the "primary topic of a work", which should have a Wikidata entry. It doesn't have to be just one - for instance, the article Causal or not: applying the Bradford Hill aspects of evidence to the association between Zika virus and microcephaly (Q24261170) currently has Zika virus, microcephaly and Bradford Hill criteria.

On Sun, Mar 19, 2017 at 4:41 PM, Stefan Kasberger notifications@github.com wrote:

What do you exactly mean with the term "main subject"?