ourresearch / journalsdb

Open database of scholarly journals
https://journalsdb.org
MIT License
10 stars 0 forks source link

Field for date of the last DOI? #34

Open sckott opened 2 years ago

sckott commented 2 years ago

hi @caseydm

In Unsub, we exclude journals from user dashboards if the journal is not publishing anymore. The way it's currently done is not ideal. It would be helpful to have a new field in the API that indicates the date of the last DOI for the journal. e.g.,

{
  "date_last_doi": "2020-01-04"
}

We could then use that date to calculate time since last DOI and determine whether it's publishing anymore based on whatever criteria we choose.

Do you think this is feasible?

If this is done, Heather said okay to do this in a month or so, I imagine after Openalex stuff is wrapped up.

caseydm commented 2 years ago

Hi @sckott. Sure I can work on this and get it implemented. Hopefully this week.

sckott commented 2 years ago

Thanks!

sckott commented 2 years ago

Thanks for getting this implemented Casey!

Still seeing a number of null's in the date_last_doi field. What leads to a null? What does it mean exactly?

caseydm commented 2 years ago

Hi Scott. The way we get the data is by calling the crossref API against each ISSN-L like this:

https://api.crossref.org/journals/2406-7768/works?sort=published&rows=1&mailto=team@ourresearch.org

Then store the created date in the response. I tested quite a few nulls and most give a response like this:

https://api.crossref.org/journals/2052-3963/works?sort=published&rows=1&mailto=team@ourresearch.org

What I will try to do is modify the script to try all of the ISSNs associated with the ISSN-L if there is nothing found in the first pass. Maybe that will decrease the number of nulls.

caseydm commented 2 years ago

Ok that script is running now and it's helping a lot so far! There were 13,500 journals with null date_last_doi and now there are 10,102. Hopefully it keeps going down.

sckott commented 2 years ago

Thanks for clarifying! And for making an improvement.

I still haven't incorporated yet into Unsub - but on my to do list - will have more feedback then im sure

sckott commented 2 years ago

I've got the Xplenty job updated to incorporate the new field.

Working on integrating into the unsub backend code.

There's still ~10K records with null for date_last_doi. One thing I notice looking at the null's is that a lot of them are book series. Do you happen to have a type field like Crossref has that's not exposed in the API?

sckott commented 2 years ago

Casey, how often is date_last_doi updated? Just so I know

sckott commented 2 years ago

Running through some examples to make sure I'm getting this as right as possible. See the spreadsheet https://docs.google.com/spreadsheets/d/1ub9pwULU0S6GIkv_yQG4LmPgc7rij5An4AuTQGQtHaM/edit#gid=169462891

The spreadsheet has all publishers, but I've filtered to just the big 5 we support in Unsub for now. There is a column for date_last_doi, as well as the is_currently_publishing field as it's currently determined using dois_by_issued_year and is_currently_publishing_new column using the date_last_doi field (where if the date in date_last_doi is less than 1 year from today, then its currently publishing). The "diff" column is whether the old and new method differ. The "change correct?" column is whether the change seems correct or not. And there's a "notes" column.

The top 13 rows I've investigated so far.

Seems that you're pulling from the created field in Crossref API response - via https://github.com/ourresearch/journalsdb/blob/f020956faa117e7c78e21d36097b71c78f521723/operations/status/status_date_last_doi.py#L27

Maybe we should reconsider that choice? That is, from what I've seen, when articles are given a DOI long after publication date, we're going to get false positives for still publishing. The spreadsheet has a number of examples. E.g., papers published in 1984, 1985, and even in 1910 - have created years of 2021 or 2022.

Should we use issued or published instead?

Or maybe I'm missing a good reason to use created?

caseydm commented 2 years ago

Hi Scott. The script is set to run every 24 hours. I'm going to check on it shortly and make sure it's running well.

As to using the created field, yes I agree that those two options are better. From looking at a few examples they are typically the same date. So I will change it to published.

caseydm commented 2 years ago

@sckott there is one wrinkle we need to work through by using the published date. I noticed for a lot of these the published dates are older than the created, like this:

https://api.crossref.org/journals/0956-5000/works?sort=published&rows=1&mailto=team@ourresearch.org

But we set the script to only update the last DOI date if the date is newer, due to setting some of the fields manually. I think the most accurate way to do this is to set all the dates to the current published date, then go back and manually set those dates again for the journals that you sent me. Do you have that initial list? Then we can start the script again, only updating if a newer date is found. But now it will be using the more accurate published field rather than the created. What do you think? It's a bit confusing so we can do a google meet sometime if needed.

sckott commented 2 years ago

Okay, every 24 hrs, thanks.

Thanks for digging into this! Taking care of our kid today - I'll reply on this tomorrow morning ...

sckott commented 2 years ago

I see what you mean about using published, and having to restart it since some published dates are older than created.

I'll see if I can dig up the ones I asked you to set manually. I don't have a list of them, but I'll see if I can find them.

caseydm commented 2 years ago

Ok great. I paused the update for now so we can get this implemented. I’ll see if I can find that list as well.

sckott commented 2 years ago

I updated the the spreadsheet - now we have about 50 rows filled out for whether the change for still publishing is correct or not using the date last doi field. Could be useful for you in terms of highlighting different scenarios.

sckott commented 2 years ago

For date_last_doi set manually, the only one I see is for Scientific American via this thread

caseydm commented 2 years ago

Ok sounds good! The updated script is running now. So last know dates are being set to the published date.

sckott commented 2 years ago

Thanks!

sckott commented 2 years ago

Casey, what's the status of the updated script running? I checked and some titles date last doi have been changed but some haven't - I assume it's not run through all titles yet?

caseydm commented 2 years ago

Hi Scott. It's run through several times. I think we need to put a current as of to see progress, but we already have a current as of for the other publishing status field. I'l email you to set up a zoom meeting so we can discuss.

sckott commented 2 years ago

I pulled the status column into our unsub database table. The distinct values for status are: incorporated, unknown, renamed, publishing, ceased

What do incorporated and renamed represent and when are they assigned?

sckott commented 2 years ago

hi again. Looked at the updated data for date_last_doi and it looks like there's a lot of entries for the date 2021-01-01 (367) and 37 for 2022-01-01, see tab v3 in the spreadsheet

If this is the code you're using https://github.com/ourresearch/journalsdb/blob/a489f44d3f5d5ac113ceeaabdc49f8d1b6a7b09d/operations/status/status_date_last_doi.py#L33-L46 it looks like if you don't get a 3rd element (the day) you fail out and just go with year. Could we add another try at getting the year and month alone without day? That would get us e.g., from 2021-01-01 to 2022-12-01 if the published date in Crossref is e.g., {'date-parts': [[2021, 12]]} - which is much more accurate.

What do you think?

sckott commented 2 years ago

Another issue: the Crossref API is returning different results for different ISSNs for the same title. e.g, compare these two

https://api.crossref.org/journals/1687-8329/works?sort=published&rows=1&mailto=scott@ourresearch.org https://api.crossref.org/journals/1110-1083/works?sort=published&rows=1&mailto=scott@ourresearch.org

The former giving 2022 and the latter giving 2016.

I don't know how widespread this issue is. Would it be possible to for each ISSN, make a request for all ISSNs associated, and then take the most recent published date from those requests? Or does that no make sense?