Closed jillpe closed 1 year ago
identifier: 10524/63469 imports successfully, but identifier: https://congressarchivesdev.lib.wvu.edu/catalog/10524%2F63469 does not
Through initial observation, this field cannot be a uri because the app currently uses it to build the uri for the work's show page. record.id is the same as record.identifier.
For example, identifier: 10524/63469 => https://congressarchivesdev.lib.wvu.edu/catalog/10524%2F63469. if the identifier were http://hdl.handle.net/10524/63469 In theory it would be a double url like https://congressarchivesdev.lib.wvu.edu/catalog/http://hdl.handle.net/10524/63469 which isn't a valid uri.
So that's something to think about. We could probably tell bulkrax to still allow for the uri import but have it programmatically strip off the http://hdl.handle.net/ bit of it. Or change the way the domains are getting formed to not use the identifier property.
Could we use a different property to store this uri instead, knowing that the identifier is special (since it's used to build the work's uri)?
Digging further into it, it's because of this line of code. When a record gets created, this method assigns the id to be the identifier. the record's id would be a uri. It cannot be a uri.
This is not a bug. The git blame points to Tracy, 23 months ago. This behavior is intentional.
Marking this ticket as blocked until we speak with Tracy.
Ref this thread for the slack convo.
Tracy requests an opinion about if they should take a different approach with using identifiers as ids.
Additionally we noticed that so far, all of the sample data's ids start with 10524. With lessons learned from GBH, this will cause speed problems with fedora down the line as the app acquires more data.
cc @jeremyf to review when he's back from connect
@kirkkwang came up with a clever solution in the meantime that will allow the user to import the identifiers as uri. We are stripping out the protocol and domains of the strings so that this app can still save the remaining bits as the record's id. This is similar to what #assign_id already does.
Whether or not using the identifier as an id is a good idea is another outstanding question.
We're also wondering if the identifier pattern could lead to fedora issues down the road, as seen with GBH. All of them will start with the same prefix 10524, based off of the provided sample file. Does this mean fedora will store all the records in the same "box"? (Which would slow down the system as it grows.)
✅ pass
Import successful with URIs.
Rob's recommendation:
Summary
We expected the identifier field to accept URI's as a string, however, WVU is seeing the opposite.
WVU uploaded a test CSV including URI handles in the identifier field and the import failed.
Screenshots
https://congressarchives.slack.com/files/U060N55N3U5/F063BL7PE65/screen_shot_2023-10-26_at_11.46.43_am.png?origin_team=T0493R3NLE9&origin_channel=D0613MNC717Then they re-uploaded the same file, only change was to remove the https://hdl.handle.net/ from the beginning of the identifier and the importer worked.
Import that failed: hcpcsamplemetadata_SEAedit410-25.csv
Accepted Criteria
Notes
https://congressarchivesdev.lib.wvu.edu/importers/16
Testing Instructions
Import SAMPLE FILE using Bulkrax -
hcpcsamplemetadata_SEAedit410-25.csv