Analyse openaire similarity mechanism

pvgenuchten commented 3 months ago

Openaire has a mechanism to identify in which repositories a resource is included, the origins of the record are stored with the record, including a string identifier of the platform, the identifier should be used to retrieve the url of the record, so users can click from the record in soilwise to one of its origins

examine how openaire advertises the origins
identify if openaire has a mechanism to find url by Id
Define mapping for popular platforms between Id and url (can also be based on doi)

BerkvensNick commented 2 months ago

documentation from OpenAire on their similarity mechanism: docs

BerkvensNick commented 2 months ago

OpenAIRE assigns internal identifiers for each object it collects. By default, the internal identifier is generated as sourcePrefix::md5(localId) where:

sourcePrefix is a namespace prefix of 12 chars assigned to the data source at registration time localΙdis the identifier assigned to the object by the data source

docs

BerkvensNick commented 2 months ago

@pvgenuchten, I had a look at this issue, not sure I fully understand it, but this is what I did: used the api "https://api.openaire.eu/search/datasets?format=json" to extract 900 records, I then extract certain fields:

obj_identifier = item['header']['dri:objIdentifier']['$']
collectedfrom_1 = item['metadata']['oaf:entity']['oaf:result']['collectedfrom']['@name']
original_id = item['metadata']['oaf:entity']['oaf:result']['originalId']
collectedfrom_2 = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['collectedfrom']['@name']
hostedby = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['hostedby']['@name']
webresource = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['webresource']['url']['$']

the results are in the attached csv soilwise_openaire_doi.xls

the origin of the record is the 'sourcePrefix' (a namespace prefix of 12 chars) of the obj_identifier, e.g. OmicsDI::47167d2e7a363dcb907e77d4a5c948d7, the 'sourcePrefix' = 'OmicsDI',

this 'sourcePrefix' does not always seem to be 'unique',e.g. 'doi_____' is used for Datacite,Crossref and Zenodo (see more info and examples at the bottom of the following webpage)

In some cases the sourcePrefix can be used to generate the url from the id, e.g. in case of the prefix 'doi_____' , the reoccurring pattern to generate the url = 'https://doi.org/' + original_id (when extracting the information of the originalId from the api response there are 2 values, we have to select the value without the '50|' string-pattern to construct the url)

But this is not always the case, see the first 3 records in the attached csv, _____OmicsDI: is linked to the following record-urls: https://www.omicsdi.org/dataset/gpmdb/GPM11210027561, https://www.omicsdi.org/dataset/omics_ena_project/PRJNA267992, https://www.omicsdi.org/dataset/geo/GSE63974. The URL "https://www.omicsdi.org/dataset" seems to be further identified based on hostedby information ['The Global Proteome Machine Database' (/gpmdb/), 'European Nucleotide Archive' (_/omics_enaproject/), 'Gene Expression Omnibus' (/geo/)]

For the mapping from id to url, I think this will be a combination of 'sourcePrefix' (+ hostedby) + originalId I think we can deduce this info by extracting multiple records and use a rule based logic based on the 'sourcePrefix'?

I have added the code I use in the notebook 'soilwise_openaire' in the github repository.

I hope I understood the question correctly, if not let me know.

pvgenuchten commented 2 months ago

thank you for the work, the main question is, for any harvested record, can we capture the platforms it has been found, and on those platforms, where is the url to access it, see for example this record. I think you are on the right track, introducing a concatenation pattern based on sourceprefix. suggestion would be to try it out for some of the popular platforms (cordis, zenodo, dataverse, gbif), then evaluate if it makes sense

https://api.openaire.eu/search/publications?format=json&page=16&size=200

it is apparently available in the collectedfrom and originalid section, but no direct url, i wonder if we can derive for popular platforms the direct url from those 2 properties

pvgenuchten commented 2 months ago

on the other hand, i like what openaire states here:

it's probably best to only link to formal pids, because localid seems unstable over time

pvgenuchten commented 2 months ago

my suggestion would be to close this issue, but document its findings (as evidence in our reports)

BerkvensNick commented 2 months ago

hi @pvgenuchten , I probably don't fully understand the issue, but there seems to be a direct url in the jsonfile for each record in the fields "children.result.instance.url" or "children.result.instance.webresources"?

in some cases you can also derive the direct url from the collectedfrom and originalID fileds, the url of the example you provide in the higher comments is e.g. sourcePrefix = "openaire____" -> doi.org/10.1029/2018WR024608 ( 'sourcePrefix' + correct originalId) but as mentioned, in some cases the direct url will be based on 'sourcePrefix' (+ hostedby) + originalId and it will take some effort to determine this. But as you mentioned we could do this for the more popular platforms.

Fine for me to close issue

soilwise-he / similarity-finder

Analyse openaire similarity mechanism #9