[FEATURE] Central dataset[s] for training & testing methods of matching affiliation strings to ROR IDs, plus performance metrics

amandafrench commented 1 year ago

Describe the problem you would like to solve Example: Publishers, funders and other entities who are using the ?affiliation parameter and third-party ML tools to match affiliation strings to ROR IDs would like centrally hosted training and testing datasets so that they can train and test their tools. Performance statistics on speed, accuracy, recall, precision, and F1 would also help organizations make decisions about which tools to use for this purpose.

Describe the solution you'd like ROR should develop and host training and testing datasets for this purpose and maintain and update them as needed. ROR should also encourage those who build these matching tools to open source them/or and to publish their performance metrics with ROR.

Who would benefit from this feature? CZI and Wiley have both requested these metrics explicitly, and OA Switchboard would benefit as well. Both OpenAlex and Semantic Scholar have published the performance metrics of their tools, but they too expressed the desire for standard curated datasets to use. Anyone who faces the task of matching affiliation strings to ROR IDs would welcome an evidence-based way of deciding which existing tool to use for this purpose or ways of testing their own custom tools.

Additional information Manuscript submission systems that extract metadata from PDFs or documents rather than asking authors to enter it in structured fields can also benefit from these data and measurements.

aanastasiou commented 1 year ago

I am dealing with a transfer of existing models from GRID to ROR and would be interested in working on this.

This can be created by attaching a "modification risk" status to each organisation. The modification risk can be defined as the ratio of the number of times a change was made to an organisation over the total number of times the organisation appears in the database.

For example: Organisation A has appeared in 20 releases of GRID/ROR. Out of those 20 releases, its data were amended 5 times. Its risk is 1/4. Organisation B has appeared in 4 releases and its data were amended 0 times, its risk is 0.

The benchmark dataset would include the top N organisations that have over M entries in successive releases, have status:active and a risk less than some threshold T.

This would provide the top M stable organisations in the datasets.

This could be made more complicated or simpler if required (for example, certain changes to the records might be deemed trivial and therefore not increase the risk).

Since all GRID and ROR datasets are available, I can contribute this simple analysis, of course as long as it would be of interest.

adambuttrick commented 4 weeks ago

Said datasets are now available in Marple for existing matching (affilitation-multi-search) and new heuristic matching (affiliation-single-search):

Datasets: https://marple.research.crossref.org/datasets/affiliations-crossref-2024-02-19 https://marple.research.crossref.org/datasets/affiliations-springer-2023-10-31

Benchmarks: https://marple.research.crossref.org/strategies/affiliation-multi-search/resultsets https://marple.research.crossref.org/strategies/affiliation-single-search/resultsets

adambuttrick commented 4 weeks ago

Modification risk as described would be a separate benchmark that can only be derived at a point in time from the entire set of ROR data releases.

ror-community / ror-roadmap

[FEATURE] Central dataset[s] for training & testing methods of matching affiliation strings to ROR IDs, plus performance metrics #147