psu-libraries / cho

Project for Penn State Library's cultural heritage object repository.
Apache License 2.0
18 stars 2 forks source link

Validate batch import using agents #736

Closed awead closed 5 years ago

awead commented 5 years ago

When batch creating works via CSV, verify that non-existent creators are invalidated at the dry-run.

This is done when:

Child of #87

awead commented 5 years ago

TODO

awead commented 5 years ago

@ntallman this will also need to include the role as well, I'm assuming. In the CSV, how should we include all that information? We could do semi-colon separated, ex:

John Lennon, vocals; Paul McCartney, bass; George Harrison, guitar; Ringo Starr, drums

This gets a bit more difficult if you have suffixes in the name:

Frank Sinatra, vocals; Sammy Davis, Jr., vocals

There'll also be a challenge in matching creators. We'd need to determine first and last name from the string, and this can be problematic. The dry-run will help with that, but we may need to experiment with some different strategies.

ntallman commented 5 years ago

@awead I'm consulting on the first part of this. Most best practices I can find advise against mixing information in one field, which might mean we need to do something like Creator and Creator Role; which could be repeated Creator 1, Creator Role 1; Creator 2, Creator Role 2. But that's horrible inellegant, hoping to find a better option. We did something like this for adding files to works via CSV in Cincinnati.

Your second problem is an epic all of it's own and may be out of scope for this ticket. Agent disambuguation has plagued metadata folks forever. I think this is one of the biggest reasons why @ruthtillman is in favor of pulling in Questioning Authority, which I think helps, but I'm not totally sure because I've mostly heard about it being used in the context of forms, not CSV import.

awead commented 5 years ago

@ntallman I talked about this with @cam156 and she suggested using separators, which we currently are using in fields like subject. Ex:

subject1|subject2|subject3

We could employ a second separator, ex:

Doe, John||author|Doe, Jane||illustrator

The double pipe (vertical bar) separates the name from the role, and the single pipe separates each creator/role unit. As far as name disambiguation goes, we can split last name and first name on the comma and CHO is going to match against both those fields. Since we're requiring that the agent resource be present in the system, we should be able to avoid the messy name disambiguation problems from the get-go; however, the challenge is getting a good "interface" in the CSV so that folks can enter names correctly.

QA can be used here. It would find the best match to the name. These results can be reported to the user in the dry-run portion of the CSV upload. That way, if the user misspells the name, QA would have a change to "autocorrect" that in the sense that it would find the closest best match (ideally) and report that result back to the user in the dry-run.

ruthtillman commented 5 years ago

I definitely support the use of a pipe, which should never occur in things like names.

ntallman commented 5 years ago

Would we use the full relator term? (case insensitive?) Or would we use just the code? Doe, John||Author|Doe, Jane||Illustrator or Doe, John||au|Doe, Jane||ill

The metadata wiki page needs to be updated with the answer.

awead commented 5 years ago

I think either ought to work. We can do our best to match and they'd see the result in the dry-run.

awead commented 5 years ago

@ntallman we won't be able to use commas to separate first and last names because the commas already separate different fields. We could try using csv files that have quotes around their fields such as: "field one","field2"

ntallman commented 5 years ago

@awead How about semicolons to delimit agents? That would be easier on the production side.

awead commented 5 years ago

so Doe; John||Author|Doe; Jane||Illustrator or Doe; John||au|Doe; Jane||ill ?

ntallman commented 5 years ago

I suppose this is why some repositories use tab-delimited instead of comma-sepearate values. For now, let's go with that. I suspect we may run into this issue again though and encounter unexpected commas in metadata.

Would there be any way to have CHO take a supplied CSV, which uses commas, and convert it on-the-fly to enclose fields with quotes before processing? Or does that have to happen in the creation software? We may have to iterate on this for the ideal solution.

awead commented 5 years ago

It would have to happen with the creation software. I don't know what the downsides are with tab-delimited, but we could explore that.

ntallman commented 5 years ago

Let's stick with CSV for now, it's easier to work with than tab-delimited data in spreadsheet form for editing, but keep tab-delimited up our sleeves if we hit a wall.

We aren't necessarily going to have a single way of creating batches and CSVs, so that's what worries me about requiring quotes, being able to reliable produce them. I've been thinking about tools to produce CHO batches and that would help, but might be out of scope for CHO itself.

awead commented 5 years ago

It seems like the comma and quote problem is unavoidable because you'll have description fields with different kinds of punctuation.

awead commented 5 years ago

OK, I've fixed the quote problem. Turns out the csvs we were generating for testing were not escaping fields correctly. We can now use commas to separate first and last names. However, we will need to switch the pipe separators. Now, the double-pipe || separates major fields, and the single | any subfields. Ex:

Doe; John|au||Doe; Jane|ill

The problem with doing single pipes first, is that any doubles would be separated as well. So something like Subfield A1||Subfield A2|Field B would result in: "Subfield A1", empty, "Subfield A2", "Field B". If you switch it around like Subfield A1|Subfield A2||Field B, the first pass gives you "Subfield A1|Subfield A2" and "Field B", then you can split the first one into its two subfields.