Open sckott opened 2 years ago
Hi @sckott. Sure I can work on this and get it implemented. Hopefully this week.
Thanks!
Thanks for getting this implemented Casey!
Still seeing a number of null
's in the date_last_doi
field. What leads to a null
? What does it mean exactly?
Hi Scott. The way we get the data is by calling the crossref API against each ISSN-L like this:
https://api.crossref.org/journals/2406-7768/works?sort=published&rows=1&mailto=team@ourresearch.org
Then store the created date in the response. I tested quite a few nulls and most give a response like this:
https://api.crossref.org/journals/2052-3963/works?sort=published&rows=1&mailto=team@ourresearch.org
What I will try to do is modify the script to try all of the ISSNs associated with the ISSN-L if there is nothing found in the first pass. Maybe that will decrease the number of nulls.
Ok that script is running now and it's helping a lot so far! There were 13,500 journals with null date_last_doi and now there are 10,102. Hopefully it keeps going down.
Thanks for clarifying! And for making an improvement.
I still haven't incorporated yet into Unsub - but on my to do list - will have more feedback then im sure
I've got the Xplenty job updated to incorporate the new field.
Working on integrating into the unsub backend code.
There's still ~10K records with null for date_last_doi
. One thing I notice looking at the null's is that a lot of them are book series. Do you happen to have a type field like Crossref has that's not exposed in the API?
Casey, how often is date_last_doi
updated? Just so I know
Running through some examples to make sure I'm getting this as right as possible. See the spreadsheet https://docs.google.com/spreadsheets/d/1ub9pwULU0S6GIkv_yQG4LmPgc7rij5An4AuTQGQtHaM/edit#gid=169462891
The spreadsheet has all publishers, but I've filtered to just the big 5 we support in Unsub for now. There is a column for date_last_doi
, as well as the is_currently_publishing
field as it's currently determined using dois_by_issued_year
and is_currently_publishing_new
column using the date_last_doi
field (where if the date in date_last_doi is less than 1 year from today, then its currently publishing). The "diff" column is whether the old and new method differ. The "change correct?" column is whether the change seems correct or not. And there's a "notes" column.
The top 13 rows I've investigated so far.
Seems that you're pulling from the created
field in Crossref API response - via https://github.com/ourresearch/journalsdb/blob/f020956faa117e7c78e21d36097b71c78f521723/operations/status/status_date_last_doi.py#L27
Maybe we should reconsider that choice? That is, from what I've seen, when articles are given a DOI long after publication date, we're going to get false positives for still publishing. The spreadsheet has a number of examples. E.g., papers published in 1984, 1985, and even in 1910 - have created
years of 2021 or 2022.
Should we use issued
or published
instead?
Or maybe I'm missing a good reason to use created
?
Hi Scott. The script is set to run every 24 hours. I'm going to check on it shortly and make sure it's running well.
As to using the created field, yes I agree that those two options are better. From looking at a few examples they are typically the same date. So I will change it to published.
@sckott there is one wrinkle we need to work through by using the published date. I noticed for a lot of these the published dates are older than the created, like this:
https://api.crossref.org/journals/0956-5000/works?sort=published&rows=1&mailto=team@ourresearch.org
But we set the script to only update the last DOI date if the date is newer, due to setting some of the fields manually. I think the most accurate way to do this is to set all the dates to the current published date, then go back and manually set those dates again for the journals that you sent me. Do you have that initial list? Then we can start the script again, only updating if a newer date is found. But now it will be using the more accurate published field rather than the created. What do you think? It's a bit confusing so we can do a google meet sometime if needed.
Okay, every 24 hrs, thanks.
Thanks for digging into this! Taking care of our kid today - I'll reply on this tomorrow morning ...
I see what you mean about using published, and having to restart it since some published dates are older than created.
I'll see if I can dig up the ones I asked you to set manually. I don't have a list of them, but I'll see if I can find them.
Ok great. I paused the update for now so we can get this implemented. I’ll see if I can find that list as well.
I updated the the spreadsheet - now we have about 50 rows filled out for whether the change for still publishing is correct or not using the date last doi field. Could be useful for you in terms of highlighting different scenarios.
For date_last_doi
set manually, the only one I see is for Scientific American via this thread
Ok sounds good! The updated script is running now. So last know dates are being set to the published date.
Thanks!
Casey, what's the status of the updated script running? I checked and some titles date last doi have been changed but some haven't - I assume it's not run through all titles yet?
Hi Scott. It's run through several times. I think we need to put a current as of to see progress, but we already have a current as of for the other publishing status field. I'l email you to set up a zoom meeting so we can discuss.
I pulled the status
column into our unsub database table. The distinct values for status
are: incorporated, unknown, renamed, publishing, ceased
What do incorporated and renamed represent and when are they assigned?
hi again. Looked at the updated data for date_last_doi
and it looks like there's a lot of entries for the date 2021-01-01
(367) and 37 for 2022-01-01
, see tab v3
in the spreadsheet
If this is the code you're using https://github.com/ourresearch/journalsdb/blob/a489f44d3f5d5ac113ceeaabdc49f8d1b6a7b09d/operations/status/status_date_last_doi.py#L33-L46 it looks like if you don't get a 3rd element (the day) you fail out and just go with year. Could we add another try at getting the year and month alone without day? That would get us e.g., from 2021-01-01
to 2022-12-01
if the published date in Crossref is e.g., {'date-parts': [[2021, 12]]}
- which is much more accurate.
What do you think?
Another issue: the Crossref API is returning different results for different ISSNs for the same title. e.g, compare these two
https://api.crossref.org/journals/1687-8329/works?sort=published&rows=1&mailto=scott@ourresearch.org https://api.crossref.org/journals/1110-1083/works?sort=published&rows=1&mailto=scott@ourresearch.org
The former giving 2022 and the latter giving 2016.
I don't know how widespread this issue is. Would it be possible to for each ISSN, make a request for all ISSNs associated, and then take the most recent published date from those requests? Or does that no make sense?
hi @caseydm
In Unsub, we exclude journals from user dashboards if the journal is not publishing anymore. The way it's currently done is not ideal. It would be helpful to have a new field in the API that indicates the date of the last DOI for the journal. e.g.,
We could then use that date to calculate time since last DOI and determine whether it's publishing anymore based on whatever criteria we choose.
Do you think this is feasible?
If this is done, Heather said okay to do this in a month or so, I imagine after Openalex stuff is wrapped up.