Open anaelizabethenriquez opened 1 year ago
Do we need to delete files in order to ensure that, or will something else take care of that?
I don't think there's anything in the automation/processing that could ensure a file in Reason 2 doesn't make it all the way through the workflow. That's assuming it's something that fits all the criteria to go into ScholarSphere. There's a chance the Activity Insight user deletes the file before we've downloaded and stored the pdf. In which case we would end up with a download error and the file record would not proceed in the workflow.
Is there a way for the admin to tell either which file was added most recently or which file (if any) was in Activity Insight at the time of the most recent import? Or conversely, to see that a file is no longer in Activity Insight (maybe from https://github.com/psu-libraries/researcher-metadata/pull/821)?
If the publication has multiple ActivityInsightOAFiles, you could compare the created_at
timestamps of each record in the Rails Admin interface to see which is the newest. If we wanted something automated, we could potentially take the data we are getting from #821 and send a request out to the Activity Insight API looking for that file record. This could be something we implement in the metadata detail page (#715). If it's found, we could display something on that page to indicate the file is still present in Activity Insight. If it's not found, then indicate otherwise.
If we were to delete files in RMD during import that we have determined no longer have a file in Activity Insight, would there be instances where we wouldn't want to delete the file? Like if a file went all the way through the workflow, was deposited in ScholarSphere, then for some reason a user deletes the file in Activity Insight. Would we still want that file record in RMD? Another option could be to set an is_active
(or something like that) boolean on the file record.
Digging into #715 a bit more. Looks like we had planned to display the date of the most recent uploaded file in the metadata review list:
show the PSU user ID of the person who added the publication to Activity Insight, the publication's title, and the date when the most recent file for the publication was imported from Activity Insight
Then, I'm assuming we'd want the most recent file to be the one that gets sent to ScholarSphere if there are multiple files ready to be sent to ScholarSphere for that publication.
If we were to delete files in RMD during import that we have determined no longer have a file in Activity Insight, would there be instances where we wouldn't want to delete the file? Like if a file went all the way through the workflow, was deposited in ScholarSphere, then for some reason a user deletes the file in Activity Insight. Would we still want that file record in RMD?
I think it would be fine if such files got deleted. ScholarSphere should be the place where we're really trying to save/preserve these.
Another option could be to set an is_active (or something like that) boolean on the file record.
This other approach seems fine to me too.
Then, I'm assuming we'd want the most recent file to be the one that gets sent to ScholarSphere if there are multiple files ready to be sent to ScholarSphere for that publication.
Yes, I agree. I saw your comment about this in #716.
To recap from our discussion at stand-up this morning, we'll try to implement this with the approach of deleting the file from RMD during import if the file has been removed from AI (or creating the is_active field as discussed above -- whatever you think is best).
We can think of two instances where a file that's been removed from AI could still get deposited. We might try to fix these later:
- If the entire publication record has been deleted from Activity Insight, RMD won't have a way to find out about that.
- If the file has been removed from Activity Insight since the most recent successful import.
Per @ajkiessl in #834:
I can think of a few reasons we'll want to remove files that have been removed from Activity Insight from RMD:
If this were just about Reason 1 or even Reason 3, this would be low priority and we could hold off on this until the next development round. Reason 2 is pretty important, though. Do we need to delete files in order to ensure that, or will something else take care of that? Related to that is how an admin chooses which file to deposit, when there are multiple files associated with a publication. (Those files could come from other Activity Insight user coauthors, but they could also be replacements.) Is there a way for the admin to tell either which file was added most recently or which file (if any) was in Activity Insight at the time of the most recent import? Or conversely, to see that a file is no longer in Activity Insight (maybe from #821)?