psu-libraries / researcher-metadata

Penn State University's faculty and research metadata repository
https://metadata.libraries.psu.edu/
MIT License
7 stars 0 forks source link

When an Activity Insight user removes a file from Activity Insight, remove it from RMD #835

Open anaelizabethenriquez opened 1 year ago

anaelizabethenriquez commented 1 year ago

Per @ajkiessl in #834:

As for deleting files that no longer exist in Activity Insight, we don't have anything implemented to do this yet. Earlier in development we didn't really have a good way to identify what has been deleted in Activity Insight. We should be able to at least identify when a file has been deleted once this PR is merged: https://github.com/psu-libraries/researcher-metadata/pull/821/files since we'll be storing the Activity Insight ID of the publication with the file record in RMD. We can use this data to determine during import if that publication no longer has a file in Activity Insight and delete it in RMD. We still wouldn't have a way to remove file records when an entire publication record is deleted from Activity Insight. I'm not too sure what the automated solution to that would look like.

I can think of a few reasons we'll want to remove files that have been removed from Activity Insight from RMD:

  1. Storage costs/efficiency
  2. Ensuring that we don't end up depositing a file that the user does not want deposited
  3. Not wasting time manually reviewing metadata for these publications/files (since we probably don't want to deposit these files)

If this were just about Reason 1 or even Reason 3, this would be low priority and we could hold off on this until the next development round. Reason 2 is pretty important, though. Do we need to delete files in order to ensure that, or will something else take care of that? Related to that is how an admin chooses which file to deposit, when there are multiple files associated with a publication. (Those files could come from other Activity Insight user coauthors, but they could also be replacements.) Is there a way for the admin to tell either which file was added most recently or which file (if any) was in Activity Insight at the time of the most recent import? Or conversely, to see that a file is no longer in Activity Insight (maybe from #821)?

ajkiessl commented 1 year ago

Do we need to delete files in order to ensure that, or will something else take care of that?

I don't think there's anything in the automation/processing that could ensure a file in Reason 2 doesn't make it all the way through the workflow. That's assuming it's something that fits all the criteria to go into ScholarSphere. There's a chance the Activity Insight user deletes the file before we've downloaded and stored the pdf. In which case we would end up with a download error and the file record would not proceed in the workflow.

Is there a way for the admin to tell either which file was added most recently or which file (if any) was in Activity Insight at the time of the most recent import? Or conversely, to see that a file is no longer in Activity Insight (maybe from https://github.com/psu-libraries/researcher-metadata/pull/821)?

If the publication has multiple ActivityInsightOAFiles, you could compare the created_at timestamps of each record in the Rails Admin interface to see which is the newest. If we wanted something automated, we could potentially take the data we are getting from #821 and send a request out to the Activity Insight API looking for that file record. This could be something we implement in the metadata detail page (#715). If it's found, we could display something on that page to indicate the file is still present in Activity Insight. If it's not found, then indicate otherwise.

If we were to delete files in RMD during import that we have determined no longer have a file in Activity Insight, would there be instances where we wouldn't want to delete the file? Like if a file went all the way through the workflow, was deposited in ScholarSphere, then for some reason a user deletes the file in Activity Insight. Would we still want that file record in RMD? Another option could be to set an is_active (or something like that) boolean on the file record.

ajkiessl commented 1 year ago

Digging into #715 a bit more. Looks like we had planned to display the date of the most recent uploaded file in the metadata review list:

show the PSU user ID of the person who added the publication to Activity Insight, the publication's title, and the date when the most recent file for the publication was imported from Activity Insight

Then, I'm assuming we'd want the most recent file to be the one that gets sent to ScholarSphere if there are multiple files ready to be sent to ScholarSphere for that publication.

anaelizabethenriquez commented 1 year ago

If we were to delete files in RMD during import that we have determined no longer have a file in Activity Insight, would there be instances where we wouldn't want to delete the file? Like if a file went all the way through the workflow, was deposited in ScholarSphere, then for some reason a user deletes the file in Activity Insight. Would we still want that file record in RMD?

I think it would be fine if such files got deleted. ScholarSphere should be the place where we're really trying to save/preserve these.

Another option could be to set an is_active (or something like that) boolean on the file record.

This other approach seems fine to me too.

Then, I'm assuming we'd want the most recent file to be the one that gets sent to ScholarSphere if there are multiple files ready to be sent to ScholarSphere for that publication.

Yes, I agree. I saw your comment about this in #716.

To recap from our discussion at stand-up this morning, we'll try to implement this with the approach of deleting the file from RMD during import if the file has been removed from AI (or creating the is_active field as discussed above -- whatever you think is best).

We can think of two instances where a file that's been removed from AI could still get deposited. We might try to fix these later:

ajkiessl commented 1 year ago

TODO