Open cthoyt opened 2 years ago
We need a way to deal with duplicates, e.g.: "Master of Arts": "Q6785149", "Master of Arts": "Q2091008",
Both are valid, by the way
are there meaningful differences between these?
@cthoyt this is only an example, there are many cases like that. They are different, yes, as one is specific to Scotland.
No duplicates, but specializations, and we can always use a more general term. That is what I do when manually curating these keys.
Pruning it automatically may prove itself an endless task due to the variety of possible items.
While we don't have an workflow for curating this duplicates, I'd rather roll back to the manually curated only version of the file.
It's fine for me if you want to roll back, but I am optimistic that creating rules for processing data would be possible. Maybe you can start by assessing how big the overlap really is by adjusting the data structure that's returned from being a dict to being more of TSV-like data
@cthoyt actually I think the duplicates appeared when I merged my curations with the automatic dict. The current code overrides the "Master of Arts" and adds only the Scottish version. It should be kept in a development branch, as it is dangerous as-is
Using a SPARQL query to get all subclasses of academic title (Q3529618) would be a nice way to pre-populate
degrees.json
. The following SPARQL query (run at https://w.wiki/5o9H) gets the job done:Caveats:
Alternate Multi-lingual SPARQL
Note that
DISTINCT
doesn't collapse entries tagged with multiple languages, but still have the same text.