Pre-populate degree dictionary

cthoyt commented 2 years ago

Using a SPARQL query to get all subclasses of academic title (Q3529618) would be a nice way to pre-populate degrees.json. The following SPARQL query (run at https://w.wiki/5o9H) gets the job done:

SELECT ?itemLabel ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Caveats:

This should be extended to multiple languages
Some labels are empty, those should be filtered out either in SPARQL or in post-processing (I realize this was likely due to there not being english labels)
There might be other terms besides academic title that are relevant, but this seems like a pretty good start

Alternate Multi-lingual SPARQL

SELECT DISTINCT ?label ?item
WHERE {
  ?item wdt:P279* wd:Q3529618 .
  ?item rdfs:label ?label .
}

Note that DISTINCT doesn't collapse entries tagged with multiple languages, but still have the same text.

lubianat commented 2 years ago

We need a way to deal with duplicates, e.g.: "Master of Arts": "Q6785149", "Master of Arts": "Q2091008",

Both are valid, by the way

cthoyt commented 2 years ago

are there meaningful differences between these?

if not, they can be merged in wikidata
if they do have differences, then how do we decide which is right? Maybe coming up with a way of pruning country-specific duplicates (as https://www.wikidata.org/wiki/Q6785149 appears to be) would be helpful in making this list smaller

lubianat commented 2 years ago

@cthoyt this is only an example, there are many cases like that. They are different, yes, as one is specific to Scotland.

No duplicates, but specializations, and we can always use a more general term. That is what I do when manually curating these keys.

Pruning it automatically may prove itself an endless task due to the variety of possible items.

While we don't have an workflow for curating this duplicates, I'd rather roll back to the manually curated only version of the file.

cthoyt commented 2 years ago

It's fine for me if you want to roll back, but I am optimistic that creating rules for processing data would be possible. Maybe you can start by assessing how big the overlap really is by adjusting the data structure that's returned from being a dict to being more of TSV-like data

lubianat commented 2 years ago

@cthoyt actually I think the duplicates appeared when I merged my curations with the automatic dict. The current code overrides the "Master of Arts" and adds only the Scottish version. It should be kept in a development branch, as it is dangerous as-is

lubianat / pyorcidator

Pre-populate degree dictionary #34

Alternate Multi-lingual SPARQL