psu-libraries / researcher-metadata

Penn State University's faculty and research metadata repository
https://metadata.libraries.psu.edu/
MIT License
7 stars 0 forks source link

Publications in ScholarSphere do not have OA Locations with source = ScholarSphere #907

Open anaelizabethenriquez opened 1 year ago

anaelizabethenriquez commented 1 year ago

I noticed a few articles in the AI OA Workflow that had already been deposited to ScholarSphere. It looks like the reason for this is that they do not have an OA Location with ScholarSphere as the source (even though they have OA Locations from other sources, e.g. Unpaywall or OA Button, with ScholarSphere URLs). This is impacting 53 publications that have Activity Insight OA Files, which we could end up depositing in ScholarSphere a second time. (I don't know if they are all in the workflow, but at least some are because that's where I found them.)

Looking at some of those, in most cases there is a DOI in the ScholarSphere record, but it's not formatted the way DOIs are formatted in RMD. Is that preventing those from getting imported? If there is a way to clean up the DOI in bulk in RMD (in consultation with the ScholarSphere team @bdezray), that would help us avoid making duplicate deposits.

I suspect this is also impacting records without AI OA files. It would be nice to clean those at the same time, since missing ScholarSphere OA URLs will cause issues with the standard workflow as well.

Here are the 53 publications: 33204 266805 275969 278420 280443 280760 282034 286177 288245 294215 294542 295499 329503 337814 402470 402706 403522 408923 409017 409101 409384 410098 410343 410876 411787 415909 418356 420151 422417 429798 429995 430018 430113 430268 431197 431874 433300 435134 436020 437960 439562 451227 453322 456473 456779 458199 459062 470628 470740 509313 509348 517769 520215

ajkiessl commented 1 year ago

@anaelizabethenriquez It looks like when we built the ScholarSphere importer we only accounted for when a doi coming from ScholarSphere starts with doi: or https://doi.org/. So, yes, the formatting is the issue here. We do some sanitizing when storing DOIs in RMD to ensure the DOIs always start with https://doi.org/. Applying this sanitizer to the DOI we get from ScholarSphere before trying to match it with something in RMD should fix the issue in the importer. I have a PR open for this change here: #910 .

anaelizabethenriquez commented 1 year ago

@ajkiessl Awesome, thanks. Once your PR is in place, will the URLs for these publications come into RMD on the next import? Or is there some retroactive clean-up we need to do?

ajkiessl commented 1 year ago

@anaelizabethenriquez We shouldn't need to do anything. Those URLs should come in during the next import.

anaelizabethenriquez commented 1 year ago

@ajkiessl I just checked on these and they don't have OA URLs with a source of ScholarSphere yet. Is that what you would expect? I thought it would have been fixed in last night's import.

ajkiessl commented 1 year ago

@anaelizabethenriquez Sorry, I did not deploy this yet to production. I have some other bug fixes I'm waiting to get a review on that I want to deploy with this.

anaelizabethenriquez commented 1 year ago

@ajkiessl No problem! Sorry to bother you. Can you ping me once it's in production?

ajkiessl commented 1 year ago

@anaelizabethenriquez No problem. Will do!

ajkiessl commented 1 year ago

@anaelizabethenriquez This has been deployed to production. Those ScholarSphere URLs should be added this evening with the next import. Let me know if this doesn't work.

anaelizabethenriquez commented 1 year ago

@ajkiessl I don't think it worked. Of the publications that have an Activity Insight postprint status, I still see 48 that have an OA URL containing the string "scholarsphere" but don't have an OA URL with the source "scholarsphere". (It's down to 48 from 53 because I fixed a few manually that I didn't think would get fixed automatically because the ScholarSphere metadata didn't have a DOI at all.)

Also, at 9:30 a.m. on November 15, I recorded the number of publications with a ScholarSphere OA URL, using the view in the admin console. At that time it was 2768. Now it is 2774. I would expect a much bigger jump. A couple of the 6 new ones are new ScholarSphere deposits by users with the Standard OA workflow. The others are probably the ones I fixed manually.

Hoping it's just that the import didn't run for some reason, but if there's other troubleshooting to be done, please let me know how I can help.

ajkiessl commented 1 year ago

@anaelizabethenriquez I dug deeper into the way the data comes out of the ScholarSphere API and found that I was wrong about the doi: / https://doi.org/ formatting. Everything comes out of the API with the doi: appended to the front and the rest of the path starting with 10 after that. So, I reverted the changes I made. After taking a look at a larger subset of the records you provided above, I found that most of them have uppercase characters in their DOIs in RMD that are lowercase in ScholarSphere. This appears to have been what was preventing them from matching. I changed the importer to do a case-insensitive match when importing. I deployed those changes, ran the import, and that seems to have fixed the issue. I did come across one where the DOI is different between RMD and ScholarSphere (record 453322 in RMD and https://scholarsphere.psu.edu/resources/e53dd903-6836-4d5b-b8a9-d259af4cd76c in ScholarSphere). There may be more like that. I'm not too sure what the cause is for those.

anaelizabethenriquez commented 1 year ago

Thanks, @ajkiessl! That seems to have done the trick. Of the 48 from this morning, only 5 remain. For at least one of those (your example above), our system is correctly distinguishing between two publications with similar titles that are not the same and have different DOIs. I'll see if I can do anything about that and also review the other 4 manually. But the programmatic stuff is working great.

anaelizabethenriquez commented 1 year ago

@ajkiessl I posted that too soon. There is one publication that I can't figure out why it's not importing a link: 517769 and https://scholarsphere.psu.edu/resources/7e9d0f5b-5e62-4c7d-92e5-84be9f65b315/

Please feel free to say I should just fix this manually. Passing it along just in case you notice something that could be affecting more records.

ajkiessl commented 1 year ago

@anaelizabethenriquez That one was a Solr indexing issue in ScholarSphere. From what I can tell, the DOI was updated on November 13th, but for some reason the record was not reindexed. So, it was not available in the Solr search and thus not available via the DOI API endpoint. I manually reindexed that record, so it should get imported this evening during the ScholarSphere import. I'll have to look into what's up with Solr indexing in ScholarSphere sometime next week.

anaelizabethenriquez commented 9 months ago

@ajkiessl Did the Solr issue mentioned above get fixed? We ran into an issue with duplicate deposits in ScholarSphere that I though might be related to this. I just checked in the 2019-2023 pubs export that you shared a few weeks back, and I see 156 publications in that export for which there is an OA URL containing "scholarsphere" but no "source" of ScholarSphere. I'm attaching a list of the pub IDs: Missing-ScholarSphere-source_publications_2019-2023_2024-01-18-09h33m59.txt Those are from the 18th, so some may no longer be an issue. I will fix a few manually now for the user who is having the problem with duplicates.

ajkiessl commented 9 months ago

@anaelizabethenriquez The fix for this is not yet deployed to production. At the time of this happening I don't think we fully understood what was causing it. Looks like this issue is very likely to be the cause: https://github.com/psu-libraries/scholarsphere/issues/1440 . I just finished up some code that fixes this. It will be deployed to production early next week.

anaelizabethenriquez commented 9 months ago

@ajkiessl Awesome. Thank you!