psu-libraries / researcher-metadata

Penn State University's faculty and research metadata repository
https://metadata.libraries.psu.edu/
MIT License
7 stars 0 forks source link

Duplicate deposits to ScholarSphere #755

Open anaelizabethenriquez opened 1 year ago

anaelizabethenriquez commented 1 year ago

Paulina has reported that ScholarSphere is getting duplicate deposits of articles from RMD. The example I have right now is RMD publication 533969, which has the following two deposits:

That publication has only one import associated with it, so I don't think it's the product of a duplicate merge. It's not in a duplicate group now. And I don't see any other records in RMD that look like they ought to be merged with it.

Both ScholarSphere records are linked from this publication. I don't understand how the user was able to deposit the article a second time, since it was already in ScholarSphere.

Update: Here are some additional examples.

473264 (This one does have two imports associated with it.)

437923 (This one has four imports associated with it. It has four ScholarSphere links associated in RMD, but I wonder if some of these are work IDs versus version IDs, because this is an older one.)

ajkiessl commented 1 year ago

@anaelizabethenriquez The title of the first one seemed familiar to me, so I checked the scholarsphere work deposit for that publication. It was one of the records that was experiencing this issue: https://github.com/psu-libraries/researcher-metadata/issues/737 . I think what happened was despite the ScholarSphere API throwing an error, it was still creating a record. It then set the scholarsphere work deposit to "Failed". When I fixed the bug I resent the work deposit, creating another record in ScholarSphere.

The second one I'm less sure of, but it does have two scholarsphere work deposits with errors, so it may have been a similar problem where it created a record despite the error. Then, it was uploaded again and it created another record.

Edit: After thinking through the second one again, I'm fairly sure that the error was caused by a bug in RMD. The upload was likely tried twice, since the first one didn't work. Then when I fixed the bug, I accidentally resent both to ScholarSphere.

I'll have to check the file ingest code for ScholarSphere. It probably shouldn't be creating records if it's returning anything other than a 200 (success) response from the API endpoint. The RMD is designed to believe that nothing was created during deposit unless the ScholarSphere deposit returns a 200.

The last one is a little bit more tricky. It looks like they are actually two pairs of different submissions. The title, publication date, and spelling of one of the creators' name is different. There are only two scholarsphere work deposits in RMD for this publication, but 4 different imports. Looking at the authorships associated with both the scholarsphere work deposits, it looks like the titles match the two different titles in ScholarSphere. So I'm assuming these were duplicates that weren't grouped last year, the user uploaded them the ScholarSphere, and then they were grouped later. This could have happened with the new grouping logic added in the Fall. I'm not too sure, though, how the other two records were created. There are no errors associated with their scholarsphere work deposits.

ajkiessl commented 1 year ago

Okay, I figured out the rest of it. Judging by the ScholarSphere ingest endpoint code, we aren't stopping the creation of a record if there's a PsuIdentity::SearchService::NotFound error like we encountered with the first example above. I can add an issue to the ScholarSphere backlog to handle this differently (here: https://github.com/psu-libraries/scholarsphere/issues/1411). We should either be returning a 200, creating a record, and then returning some kind of warning message, or returning a 400 and not create the record.

For the last example above, those are in fact work:work_version pairs. These must've gotten around our attempts to clean these out of our OpenAccessLocations. Not too sure how, though.

anaelizabethenriquez commented 1 year ago

Thanks @ajkiessl! Sounds like with the issue you added to the ScholarSphere backlog, we should be covered on this. Looping in @PaulinaKrys so she can see all this. Paulina, if you see any more examples of this, please let me know.

PaulinaKrys commented 1 year ago

Thanks @ajkiessl and @anaelizabethenriquez! I actually went through some more deposits and found a couple of additional example pairs for you to take a look at. All cases appear to be more similar to the last example from earlier where the metadata varied between submissions (formatting of the author names, etc.):

ajkiessl commented 1 year ago

@PaulinaKrys Correct, these are all a product of duplicates in RMD that were not grouped at some point. 2 and 4 have since been grouped and merged. 1 and 3 have not yet been grouped. I'm not too sure why the ones that are still not grouped are this way. They seem to fit the criteria: similar titles, publication dates, same DOI. I'll have to investigate this further.

avshoff commented 11 months ago

@anaelizabethenriquez I will be adding a few duplicate deposits in ScholarSphere from RMD. I'll drop the links, but please let me know if there's any other information I can provide that would be helpful.

Lagoa - Parsimonious System Identification from Quantized Observations

avshoff commented 11 months ago

Lagoa - HOLNET: A Holistic Traffic Control Framework for Datacenter Networks

avshoff commented 11 months ago

Lagoa - Probabilistic Discrete Time Robust H2 Controller Design

anaelizabethenriquez commented 11 months ago

Related issues #504 #386

avshoff commented 11 months ago

McCrudden - Effects of emotions, topic beliefs, and task instructions on the processing and recall of a dual-position text

avshoff commented 10 months ago

Bakis - Effect of Moisture on the Tensile Properties of Composites with Bio-based Fibers and Matrix

avshoff commented 9 months ago

Rosenbaum - Coping with childhood maltreatment: Avoidance and eating disorder symptoms (deposited 8/1/22)

avshoff commented 9 months ago

Albert - synergy: A Python library for calculating, analyzing, and visualizing drug combination synergy (deposited 8/1/22)

avshoff commented 8 months ago

Price: Fine-structure-resolved rovibrational transitions for SO+H2 collisions/Fine-structure resolved rovibrational transitions for so + H2collisions (deposited July 25, 2022)

avshoff commented 7 months ago

Muir: Initial development of the Dance Imagery Questionnaire for Children (DIQ-C): establishing content validity (deposited December 6, 2023)

avshoff commented 6 months ago

Pincus: Examining the Structure and Validity of Self-Report Measures of Dsm-5 Alternative Model for Personality Disorders Criterion A (deposited July 20, 2022)

avshoff commented 5 months ago

Case: Curved Versions of the Ovsienko–Redou Operators (deposited January 23, 2024)

anaelizabethenriquez commented 5 months ago

@ajkiessl Want to make sure you see @avshoff 's comment above about Case: Curved Versions of the Ovsienko–Redou Operators (deposited January 23, 2024).

This is a weird one, because both deposits were made with the AI OA Workflow, using the same file, on the same publication record. I didn't think that was possible.

ajkiessl commented 5 months ago

@anaelizabethenriquez My assumption here is that when depositing, an admin opened the deposit page twice for the same record in separate tabs and then submitted each one. While I'm not 100% certain, this is the only thing I could think of that could lead to this. I was also able to recreate this behavior on QA. Ideally there should be something in the code to stop this from happening, but there currently is not. I think that should be an easy fix, though, so I can look into patching it ASAP.

EricDurante commented 5 months ago

Since I'm here lurking...

What about a double-click on the submit button? Could that have caused it also? Either way, the solution is probably the same.

ajkiessl commented 5 months ago

Ah thanks Eric. Yes, that must be it and more likely than my scenario. I tested that out in QA and it created duplicates. I should have a PR ready for this fix soon.

ajkiessl commented 5 months ago

@anaelizabethenriquez The double-clicking issue has been resolved and deployed to production