scientist-softserv / west-virginia-university

West Virginia University
0 stars 0 forks source link

Bug: URI string causing import to fail #140

Closed jillpe closed 1 year ago

jillpe commented 1 year ago

Summary

We expected the identifier field to accept URI's as a string, however, WVU is seeing the opposite.

WVU uploaded a test CSV including URI handles in the identifier field and the import failed.

Screenshots https://congressarchives.slack.com/files/U060N55N3U5/F063BL7PE65/screen_shot_2023-10-26_at_11.46.43_am.png?origin_team=T0493R3NLE9&origin_channel=D0613MNC717

Then they re-uploaded the same file, only change was to remove the https://hdl.handle.net/ from the beginning of the identifier and the importer worked.

Import that failed: hcpcsamplemetadata_SEAedit410-25.csv

Accepted Criteria

Notes

https://congressarchivesdev.lib.wvu.edu/importers/16

Testing Instructions

Import SAMPLE FILE using Bulkrax -

hcpcsamplemetadata_SEAedit410-25.csv

ShanaLMoore commented 1 year ago

identifier: 10524/63469 imports successfully, but identifier: https://congressarchivesdev.lib.wvu.edu/catalog/10524%2F63469 does not

ShanaLMoore commented 1 year ago

Through initial observation, this field cannot be a uri because the app currently uses it to build the uri for the work's show page. record.id is the same as record.identifier.

For example, identifier: 10524/63469 => https://congressarchivesdev.lib.wvu.edu/catalog/10524%2F63469. if the identifier were http://hdl.handle.net/10524/63469 In theory it would be a double url like https://congressarchivesdev.lib.wvu.edu/catalog/http://hdl.handle.net/10524/63469 which isn't a valid uri.

So that's something to think about. We could probably tell bulkrax to still allow for the uri import but have it programmatically strip off the http://hdl.handle.net/ bit of it. Or change the way the domains are getting formed to not use the identifier property.

Could we use a different property to store this uri instead, knowing that the identifier is special (since it's used to build the work's uri)?

ShanaLMoore commented 1 year ago

Digging further into it, it's because of this line of code. When a record gets created, this method assigns the id to be the identifier. the record's id would be a uri. It cannot be a uri.

This is not a bug. The git blame points to Tracy, 23 months ago. This behavior is intentional.

ShanaLMoore commented 1 year ago

Marking this ticket as blocked until we speak with Tracy.

ShanaLMoore commented 1 year ago

Ref this thread for the slack convo.

Tracy requests an opinion about if they should take a different approach with using identifiers as ids.

Additionally we noticed that so far, all of the sample data's ids start with 10524. With lessons learned from GBH, this will cause speed problems with fedora down the line as the app acquires more data.

cc @jeremyf to review when he's back from connect

ShanaLMoore commented 1 year ago

@kirkkwang came up with a clever solution in the meantime that will allow the user to import the identifiers as uri. We are stripping out the protocol and domains of the strings so that this app can still save the remaining bits as the record's id. This is similar to what #assign_id already does.

Whether or not using the identifier as an id is a good idea is another outstanding question.

We're also wondering if the identifier pattern could lead to fedora issues down the road, as seen with GBH. All of them will start with the same prefix 10524, based off of the provided sample file. Does this mean fedora will store all the records in the same "box"? (Which would slow down the system as it grows.)

kirkkwang commented 1 year ago

✅ pass

Image

mmizalmond commented 1 year ago

Import successful with URIs.

Image

ShanaLMoore commented 12 months ago

Rob's recommendation:

image