internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
116 stars 18 forks source link

ISSN-L matching for JURN index #46

Open bnewbold opened 5 years ago

bnewbold commented 5 years ago

JURN is "An organised links directory for the arts & humanities, listing selected open access or otherwise free ejournals." They list 3000-4000 such journals by name, URL, and category at http://www.jurn.org/directory/, and an additional 800 ecology titles at https://jurnsearch.wordpress.com/titles-indexed-ecology-related/.

It would be great to include these in fatcat (probably via chocula first, though could go direct via API as well), and mark them as open so they will be included in broad IA crawls for preservation. However, JURN doesn't link any persistent identifiers (eg, wikidata QID or ISSN/ISSN-L), which makes it hard to reference them anywhere without duplication.

Some brainstorms of how to go about this:

Phu2 commented 4 years ago

I scraped the journal names from the JURN directory website, loaded them in OpenRefine and ran the reconciliation service against Wikidata. By automatic matching best candidates and some manual matching i got 990 matches out of 3311 journals. For these i tried to add the Wikidata-ID, ISSN and ISSN-L. Here is the comma separated file exported from OpenRefine: jurn-directory-csv.txt Can you use this for some good?

bnewbold commented 4 years ago

Hi @Phu2, sorry for the slow reply on this. It is helpful!

I wonder what we could do to increase the matching or confirm that the un-matched results are actually missing ISSNs. Could we have OpenRefine try to reconcile against the fatcat container list instead of wikidata? There are JSON dumps here:

https://archive.org/details/fatcat_bulk_exports_2020-08-05

or I could supply a .csv file if you let me know which column fields to include.