Closed awead closed 5 years ago
TODO
@ntallman this will also need to include the role as well, I'm assuming. In the CSV, how should we include all that information? We could do semi-colon separated, ex:
John Lennon, vocals; Paul McCartney, bass; George Harrison, guitar; Ringo Starr, drums
This gets a bit more difficult if you have suffixes in the name:
Frank Sinatra, vocals; Sammy Davis, Jr., vocals
There'll also be a challenge in matching creators. We'd need to determine first and last name from the string, and this can be problematic. The dry-run will help with that, but we may need to experiment with some different strategies.
@awead I'm consulting on the first part of this. Most best practices I can find advise against mixing information in one field, which might mean we need to do something like Creator and Creator Role; which could be repeated Creator 1, Creator Role 1; Creator 2, Creator Role 2. But that's horrible inellegant, hoping to find a better option. We did something like this for adding files to works via CSV in Cincinnati.
Your second problem is an epic all of it's own and may be out of scope for this ticket. Agent disambuguation has plagued metadata folks forever. I think this is one of the biggest reasons why @ruthtillman is in favor of pulling in Questioning Authority, which I think helps, but I'm not totally sure because I've mostly heard about it being used in the context of forms, not CSV import.
@ntallman I talked about this with @cam156 and she suggested using separators, which we currently are using in fields like subject. Ex:
subject1|subject2|subject3
We could employ a second separator, ex:
Doe, John||author|Doe, Jane||illustrator
The double pipe (vertical bar) separates the name from the role, and the single pipe separates each creator/role unit. As far as name disambiguation goes, we can split last name and first name on the comma and CHO is going to match against both those fields. Since we're requiring that the agent resource be present in the system, we should be able to avoid the messy name disambiguation problems from the get-go; however, the challenge is getting a good "interface" in the CSV so that folks can enter names correctly.
QA can be used here. It would find the best match to the name. These results can be reported to the user in the dry-run portion of the CSV upload. That way, if the user misspells the name, QA would have a change to "autocorrect" that in the sense that it would find the closest best match (ideally) and report that result back to the user in the dry-run.
I definitely support the use of a pipe, which should never occur in things like names.
Would we use the full relator term? (case insensitive?) Or would we use just the code?
Doe, John||Author|Doe, Jane||Illustrator
or Doe, John||au|Doe, Jane||ill
The metadata wiki page needs to be updated with the answer.
I think either ought to work. We can do our best to match and they'd see the result in the dry-run.
@ntallman we won't be able to use commas to separate first and last names because the commas already separate different fields. We could try using csv files that have quotes around their fields such as:
"field one","field2"
@awead How about semicolons to delimit agents? That would be easier on the production side.
so Doe; John||Author|Doe; Jane||Illustrator or Doe; John||au|Doe; Jane||ill
?
I suppose this is why some repositories use tab-delimited instead of comma-sepearate values. For now, let's go with that. I suspect we may run into this issue again though and encounter unexpected commas in metadata.
Would there be any way to have CHO take a supplied CSV, which uses commas, and convert it on-the-fly to enclose fields with quotes before processing? Or does that have to happen in the creation software? We may have to iterate on this for the ideal solution.
It would have to happen with the creation software. I don't know what the downsides are with tab-delimited, but we could explore that.
Let's stick with CSV for now, it's easier to work with than tab-delimited data in spreadsheet form for editing, but keep tab-delimited up our sleeves if we hit a wall.
We aren't necessarily going to have a single way of creating batches and CSVs, so that's what worries me about requiring quotes, being able to reliable produce them. I've been thinking about tools to produce CHO batches and that would help, but might be out of scope for CHO itself.
It seems like the comma and quote problem is unavoidable because you'll have description fields with different kinds of punctuation.
OK, I've fixed the quote problem. Turns out the csvs we were generating for testing were not escaping fields correctly. We can now use commas to separate first and last names. However, we will need to switch the pipe separators. Now, the double-pipe ||
separates major fields, and the single |
any subfields. Ex:
Doe; John|au||Doe; Jane|ill
The problem with doing single pipes first, is that any doubles would be separated as well. So something like Subfield A1||Subfield A2|Field B
would result in: "Subfield A1", empty, "Subfield A2", "Field B". If you switch it around like Subfield A1|Subfield A2||Field B
, the first pass gives you "Subfield A1|Subfield A2" and "Field B", then you can split the first one into its two subfields.
When batch creating works via CSV, verify that non-existent creators are invalidated at the dry-run.
This is done when:
Child of #87