ourresearch / journalsdb

Open database of scholarly journals
https://journalsdb.org
MIT License
10 stars 0 forks source link

Standardized publisher names? #31

Open sckott opened 2 years ago

sckott commented 2 years ago

hi casey, Working on cleaning up the publisher field in an Unsub database table for journal prices, and we talked about maybe using publisher names that journalsdb uses. However, looking at the data we ingest from journalsdb I'm not sure if names are standardized or not in journalsdb. For example, searching for the big five publisher names in the journalsdb data we ingest I see Wiley and Taylor & Francis are all set, but there's a few variants for Elsevier, SAGE and Springer.

Publisher Rows
Elsevier 4170
Elsevier- Churchill Livingstone 1
SAGE 1464
SAGE Publications 3
Sage Publications (Prufrock Press, Inc.) 1
Springer (Biomed Central Ltd.) 1
Springer Nature 4045
Springer Publishing Company 26
Springer-Verlag 3
Taylor & Francis 3663
Wiley 2363

It appears Elsevier-Churchill Livingstone is part of Elsevier, I think:

curl https://api.crossref.org/members/78 | jq .message.names | grep Living

"Elsevier- Churchill Livingstone",

Some of the more interesting publisher names: "tanzilmultazam@umsida.ac.id", "10.15653 (Tierarztl Prax Ausg G Grosstiere Nutztiere)", "10.35977"

Currently, there's a total of 16,366 publisher names from journalsdb.

I think publishers in journalsdb are not straight from Crossref - I think Heather said that you've done some standardizing. To what extent are they cleaned up after getting them from Crossref?

Curious your thoughts on if we wanted to use standardized publisher names, what is the best source of those?

caseydm commented 2 years ago

Hi Scott. Yes we are doing some very basic standardization of publisher names, which you can see here: https://github.com/ourresearch/journalsdb/blob/main/ingest/journals/journals_new_journal.py#L154

I was told that Springer Publishing Company is separate from Springer Nature so they are split on purpose. The others you mentioned are outliers that should have been formatted - except for Elsevier. It looks like I need to add that one to the list. I just sent you an email discussion we had on this a while back. It includes Richard's method for normalizing publisher names along with some caveats.

sckott commented 2 years ago

Thanks and thanks for forwarding the email.

Okay, i'll assess our needs and see what further standardizing is needed and where to do it