Bulkrax Export & Roundtrip issues & questions

ckarpinski commented 11 months ago

ROUND TRIP/Updating did not work - I bulk exported, then reimported to update as a test. ALL of the works were duplicated.
Required and wrong data did not stop the import - It create new works even though required fields were missing. This is not suppose to happen. THe only work that failed was one with the wrong word in a controlled field BUT we were suppose to have this set up to not import anything if there were issues with any work in the entire import and a some point i tested this and it worked.

i re exported and removed the page works you can see the duplicates all were created as GenericWorks (i didnt have a work type column, nothing said this was required and i did not intend to change it. I think the issue here is it created new instead of updated. I found in the documentation that for NEW works if you dont include work type it will use generic) https://docs.google.com/spreadsheets/d/1MCxl6neVOfzTBr3Kqifm6-1YRJ_V9eNvqS5N473fBcg/edit?usp=sharing

Here is what i used to reimport https://docs.google.com/spreadsheets/d/1yqsTtamYUb8CL5iJFjl5471bhCO70ZQHGKO6S9MB1rg/edit?usp=sharing

TESTING on dev and staging - i created 3 new imports with errors, it still created the works that did not have errors. It used to and is suppose to fail the entire importer if there are any errors.

Questions

-when you export files and metadata, is there not a field that list the filenames in the CSV? How do you associate the rows of metadata to the actual files? what am i missing?
what is the field “Label” that gets exported? its empty
Source identifier - it looks like any work created through the UI not bulkrax does not have a source identifier. This will only matter if someone is wanting to update the work and reimport correct? and if they do this will they need to assign the source_identifier?
documentation says if no work type is specified it uses a default - that should only be on new created works right? and isnt ours suppose to fail the import instead?
if you include thumbnails in your export it appears to only put a file name in the metadata, it does not actually export the thumbnail files? what is the purpose for this? For us, to use as a Digital Library import it would need to be the URL to the thumbnail or the actual file. I thought we got that set up but i guess not. THe bulkrax wiki says the the files should be in teh export but I ony see full size files not thumbnail files
Children - what is the point of this field in the export? its very confusing. If the work is a jpg for example, what is the child? I would expect this to be the actual file but the field holds a number not the file name
Parent - what is the number that is here? is it possible the id for a collection that the work is in?
If you are intending to delete a work that is a PDF - do you also have to delete every pages work file that got created by the UV gem?

orangewolf commented 10 months ago

We validate the format and year fields. If the field is required on a work type and it does not match the format we through an error. What do we do if it is not required? Do we make it blank or do we still throw an error on that records import?
all works should have a source_identifier. I agree that created works in the UI do not and that is a problem. I see a few things we can do about that
1. all works w/o a source_identifier get a ransom source identifier.  a) this can apply just to UI works  b) it could apply to both UI created or imported works [b is easier]
2. source_identifier becomes a required field in the UI and via Bulkrax.
I’m not sure what you mean about “on dev and staging - i created 3 new imports with errors, it still created the works that did not have errors. It used to and is suppose to fail the entire importer if there are any errors.” unless something was done specifically for this project (and I do not see that in the code) the importer has always continued when a single record fails. It DID used to stop if one had a parse error (no source identifier) for example, but that was a bug that was fixed. no source identifier is a tricky state because we cant create an entry w/o the source identifier. So we either randomly fill it in or skip those rows depending on settings for a given app using Bulkrax.

ckarpinski commented 10 months ago

Previously I am pretty sure (i tested it) all of our required fields had to be present for the import to be successful AND our controlled vocabs fields had to use the controlled vocab. Meaning:

institution has to be from the list and is required
Type has to be from the list and is required
Format if present has to be from the list, if it is not 4 digit year then it should fail
resource type has to be from the list and is required
for ETD, year has to be 4 number year and is required

Required must be there or it fails, controlled vocab must be used or it fails for both required and optional

Source id - I think if someone creates a work and gives it a source id they created it needs to keep that source id. I think it would be confusing to have it change. OR would it be that they dont have to create one and the importer assigns one that they can then see when its exported? I would prefer not to have to make source identifiers in the work form - random student adding work, seems like an odd thing for them to create.

LAST ONE - yes, crystal specifically asked me about this and I swear i tested it and it worked. We want the import to fail so that they do it correctly. If onyl the wrong work fails they may not realize there was an error. So the plan was fail the import, they go see what the error is and fix it. When this was first set up i tested this and it worked. I used the same importer test recently (mentione above) and it did not work that way any longer.

ckarpinski commented 10 months ago

NOTES from slack

Year only exists on ETD and it is required. (others have date created and its not required)
Institution, type, format, resource type - all are on all work types, format is not required but the other 3 are
if format is wrong it should fail, if it is blank it should succeed

orangewolf commented 10 months ago

source identifiers are taken from csv if present but are auto-generated if missing on import
source identifiers are automatically added to manually created works
there is a rake task to run so that source identifiers are added to all existing records
import used to look for ids and if they were present but not found make a new record. it now checks source_identifier AND id and gets any record it can find.
once this is done, round tripping seems to work much better
I believe that the validations are now working correctly BUT the importer status is still not reflecting correctly. that importer status is the last open piece of this work and I suggest we finish this ticket and make a small follow on ticket to track the importer status.

orangewolf commented 10 months ago

run rake cleanup:source_identifier when this code goes to production

crisr15 commented 10 months ago

Passes internal QA:
Works with incorrect validations are failing. Items with the same source identifier/ID are not duplicating when reimported.
https://crystal.atla-hyku.notch8.cloud/importers?locale=en

ckarpinski commented 10 months ago

Questions

-when you export files and metadata, is there not a field that list the filenames in the CSV? How do you associate the rows of metadata to the actual files? what am i missing?
what is the field “Label” that gets exported? its empty
if you include thumbnails in your export it appears to only put a file name in the metadata, it does not actually export the thumbnail files? what is the purpose for this? For us, to use as a Digital Library import it would need to be the URL to the thumbnail or the actual file. I thought we got that set up but i guess not. THe bulkrax wiki says the the files should be in teh export but I ony see full size files not thumbnail files
Children - what is the point of this field in the export? its confusing. If the work is a jpg for example, what is the child? I would expect this to be the actual file but the field holds a number not the file name
If you are intending to delete a work that is a PDF - do you also have to delete every pages work file that got created by the UV gem?

ckarpinski commented 10 months ago

When updating existing works by CSV:

You have to include 3 fields - work type, source identifier and title
Leave off work type and it creates a new work with the default work type and ?? for source id (does it make a second work the same source id as the original or create a new one for it?) BUT if the orig. works were the default work type it prob. doesnt do this.
Leave off source id it creates a new work with a new bulkrax created source id
Leave off title it fails as it should
These duplicate works that are created do not follow the rules of required fields or controlled vocab.

This seems problematic - Is this a known issue?

Follow up - i exported all the works created and it created a duplicate work wth the exact same source id- because it did not have the model field included in the update . you can see an example highlighted here https://docs.google.com/spreadsheets/d/18occJ3dr0VQiTzq3ifwQDm6kzP0E-HCcXbNUpFDlTqc/edit?usp=sharing

Green highlighted was suppose to update but instead created a new work with the same source id - because it did not have the model field included in the update

this was on staging here https://demo.atla-hyku.notch8.cloud/catalog?utf8=%E2%9C%93&search_field=all_fields&q=

scientist-softserv / atla-hyku

Bulkrax Export & Roundtrip issues & questions #130