ucsdlib / damspas-rd

A Digital Collections application based on Hyrax
MIT License
3 stars 2 forks source link

Create initial delimited import template #46

Closed mcritchlow closed 6 years ago

mcritchlow commented 7 years ago

Descriptive summary

Create an initial csv import template to be used for initial tool R&D.

Focusing on:

  1. hierarchical components
  2. linked records
  3. CSV vs TSV vs Excel - is there a preferred format that will make life easier for DOMM and Dev?

Deliver:

Dependency

111

Related work

Ticket migrated from https://github.com/ucsdlib/dams5-cc-pilot/issues/39

ghost commented 7 years ago

@mcritchlow funny you say that, we were just talking about using exactMatch when we are very confident.

For targeting ORCID vs other properties in the display, can we use the base URL, e.g., http://orcid.org ?

lsitu commented 7 years ago

@mcritchlow Yes, I am concerning about it either. I think it'll be nice if the ORCID can have its own predicate so that we can grab it for special handling.

remerjohnson commented 7 years ago

@lsitu @mcritchlow the makers a fabio have an ontology called "scoro": http://www.sparontologies.net/ontologies/scoro

One of the predicates is scoro:hasORCID

We could put our property within Agent as hasORCID or identifier:orcid and point to that predicate?

ghost commented 7 years ago

@lsitu I have some details in the new template to work out, but a draft version for discussion is available at: https://docs.google.com/a/ucsd.edu/spreadsheets/d/1EhMWaDFsBZDhp-BU-5LtOI0tECDx2EtAZ28iZk-Ih58/edit?usp=sharing

Cells in yellow are to be determined, so you can ignore them for now. The header list is vertical so I can show you the definitions and sample values. The final result will be a horizontal headers like the old ingest spreadsheet.

The tab "CV values" contains values allowed in controlled value columns. As in the old ingest spreadsheet, the values are labels that should be converted upon ingest. For example: CC-BY = http://creativecommons.org/licenses/by/4.0/ I have added the matching values for convenience, but you can also find them on the "dams4.2 data dictionary" Google sheet.

The tab "Formatted key-value rules" are the values allowed within { } for single-cell multi-property strings. I don;t think we need individual validation for each { } column (date, relatedResource). I expect that the keys will be applicable any time we use the { } format.

There will be more agent:role columns added. I have to get these from our other documentation.

Let me know if you have questions. If we have agreed on most things, I can turn this into a .xlsx template for your development testing.

lsitu commented 7 years ago

@GregReser The ingest template looks nice. Thank you very much.

ghost commented 7 years ago

@lsitu

lsitu commented 7 years ago

@GregReser :

lsitu commented 7 years ago

@GregReser @remerjohnson For key/value inside {} , maybe we can just define a separator char, something like comma, semicolon etc. that are used less often. If we need to use it in side a "description" text, we can escape the char with \ just like what we are currently doing with pie ( | ). Once we can determine the key/value in pairs, then we can just split them for the key and value. What do you think?

ghost commented 7 years ago

@lsitu

lsitu commented 7 years ago

@GregReser Just a couple of little questions regarding the two issues above: For CVs, maybe we just need the instructions to build the values for each field and ingest it into the Fedora. For example, with "eng", how would you like the uri to be?

For key/value inside {} , don't you expect our program to report any wrong keys? I think we can just make it work anyway. But with the simple key/value cases that you have now, if we can say white space is the separator for multiple key/value pairs, then it'll become much more general, thoughts?

ghost commented 7 years ago

@lsitu if we need a reliable separator that could be used for future cases, then I want to use " | ". Whitespace is too generic and common in values. I know we use " | " to indicate multiple values in other cells, but I think the use inside {} is similar: we are separating multiple key:value pairs.

We could use another delimiter, e.g., "; " within {}, but anything we think of might be contained in another string somewhere in the spreadsheet. At least with " | " we are aware of it's use as a separator.

ghost commented 7 years ago

@lsitu For CV values build values

Are there any others that need clarification?

ghost commented 7 years ago

@lsitu For linking of the range ucsd:Agent - lookup label and alternateLabel

Your questions are very good and difficult to answer. Here's my first thoughts:

lsitu commented 7 years ago

@GregReser Thanks for clarifying all these. For linking of the range ucsd:Agent - lookup label and alternateLabel *****If we lookup records with label and alternateLabel, then we need both the label and alternateLabel in the spreadsheet, and we'll allow different combination of label and alternateLabel of ucsd:Agent records to be created. This means, the label won't be unique, which will be conflicting with your assumption for the last question below.

lsitu commented 7 years ago

@GregReser For key/value pairs in {}, I think we cannot use pie ( | ) as the separator since we've used it to delimit multiple values, unless we can clarify either key/value pairs or multiple values will be allow in a cell, but not both. It seems like "; " is a better fit than the space. If anyone don't like it, let's move forward with the origin plan and we'll make it work for now then. Thanks.

ghost commented 7 years ago

@lsitu For key/value pairs in {}, you may be right, but if we use "; " as a separator within {} won't that cause a problem when we have "; " as part of a string in other cells? In other words, we would treat "; " as a separator only some times.

lsitu commented 7 years ago

@GregReser No, the separator for key/value pairs in {} only affects the values inside the {}. It'll be safe to use it anywhere else in the spreadsheet. For example, if we choose to use ";" as the key/value pairs separator, we can still use it intact outside of the {} without escaping in the current cell or any other cells.

lsitu commented 7 years ago

For example, if we choose "," as separator, the following cell values are valid: Jan. 1st, 1925 Jan. 1st, 1925@{begin:1925-01-01, end:1925-08-03} Jan. 1st, 1925@{begin:1925-01-01, end:1925-08-03} | Jan. 1st, 1925

ghost commented 7 years ago

@lsitu I assume you are thinking that a multi property cell could contain more than one value? You want to be able to split the multiples first then parse the key:values inside {} Example date:issued April 8, 1908 @{begin:1908; end:1908} | June 30, 1935 @{begin:1935; end:1935}

OK, let's use "; " Since we are changing this, how about using key=value for readability? : and ; look very similar. I'm taking inspiration for the Dublin Core formatted text string spec: http://dublincore.org/documents/dcmi-dcsv/ Our use is a little different.

Example: April 8, 1908 @{begin=1908; end=1908}

ghost commented 7 years ago

@lsitu I missed your example above showing " | " when I was writing my comment. I had not planned on allowing multiple values in a {} cell, which is why I didn't see a problem. I guess we should allow multiple values, since that is our general rule. So yes, let's use "; " inside {}

lsitu commented 7 years ago

@GregReser Since we are not following the JSON format for key/value pairs in {}, it sounds good to use ";" as key/value pairs separator and and use "=" as key/value delimiter. I think your example looks good: April 8, 1908 @{begin=1908; end=1908}

ghost commented 7 years ago

@lsitu Great, I'll update the template.

ghost commented 7 years ago

@lsitu Matching records by string - Right, we don't want duplicate ARKs for the same Agent, Place, or Concept. It seems that we want to lookup all labels and alternateLabels to make sure there is not a match or possible match. If we only search labels, we might miss a match. The possible matches will require that a person makes the final decision.

In the ingest spreadsheet, I don't think we can include both label and alternateLabel partly because we wouldn't know what the alternates are - we only know what the data provider gave us. If they gave us "Reser, Greg" and there is an ARK with alternateLabel "Reser, Greg" and label "Reser, Gregory" we want the report to say: possible match found with label "Reser, Gregory". These would be flagged for human review. Would it possible to make something like this work?

lsitu commented 7 years ago

@GregReser Maybe we can do something with non-work records pre-validation in Support batch ingest of non-Work records to identify all the records with the same label? Once an authority record with a label and an alternateLabel is identified to be used, then we have to include both the label and alternateLabel in the ingest spreadsheet. If we only include the label in the ingest spreadsheet, we'll never know which record is used for ingest if there are several authority records that have the same label exists.

ghost commented 7 years ago

@lsitu Perhaps handle all the records with the same label or alternateLabel in the non-work (authority) records pre-validation and only match label in the object ingest. This puts the burden on us to do the proper pre-ingest validation and conflict resolution, but it makes the final ingest easier. Adding alternateLabels to the final ingest spreadsheet would be messy, so it would be better to avoid it.

On final ingest, we should get a report of all new ARKs generated (as we do now) and we should add the Class and label that was created. These could be reviewed manually if anything looks wrong.

lsitu commented 7 years ago

@GregReser Yes, I agree. So in the final ingest, only the label will be used in the spread sheet we will lookup and use any authority record that matched the label in the spreadsheet. I think we can extend it to use the @{} syntax to include the known alternateLabel later when it's necessary. Does it sounds good?

ghost commented 7 years ago

@lsitu Sounds good!

mcritchlow commented 7 years ago

@GregReser @lsitu - I know we've shifted focus to other things lately, but I wanted to circle back to this effort and see if it would be possible to articulate what remaining gaps we have for this ticket, whether that work can be broken down into individual tickets, etc. Curious what you both think, thanks.

ghost commented 7 years ago

@mcritchlow I think these are the remaining gaps:

@lsitu In the object_import_template, I've been highlighting changes in yellow to make sure they you see them. I suggest you remove the yellow when you have implemented the change. This way we will all know what remains to be updated.

Anything else?

lsitu commented 7 years ago

@GregReser It looks like there could be more issues. See my comments in the attached Excel spreadsheet in https://github.com/ucsdlib/horton/issues/69#issuecomment-291249356. And also, we may need instructions for assembling the title column fields into a big title string, right?

ghost commented 7 years ago

@lsitu good question about title. Are you thinking of how we currently assemble title, part name, part number, subtitle in mads:authoritativeLabel? I don't think we are doing this in D5.

lsitu commented 7 years ago

Yes. When I first see those bunch of title fields. Then I think it'll be just the UI displaying issue that we need to talk about later. Thanks.

ghost commented 7 years ago

You're right, it will be a messy display issue :)

lsitu commented 7 years ago

@GregReser It looks like we can't use property name "visibility" (row #134), which is hold as keyword in Hyrax. The following error is thrown from Hyrax: ArgumentError: visibility is a keyword and not an acceptable property name. /Users/lsitu/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/active-fedora-11.0.0.rc7/lib/active_fedora/attributes.rb:127:in define_active_triple_accessor' /Users/lsitu/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/active-fedora-11.0.0.rc7/lib/active_fedora/attributes.rb:118:inproperty' ...

Do you have any thoughts to avoid it?

ghost commented 7 years ago

@remerjohnson @arwenhutt Should we change visibility and visibility expiration to: rights override rights override expiration

ghost commented 7 years ago

@lsitu Let's go with: rights override rights override expiration

lsitu commented 7 years ago

Sure and it's done. Thanks Greg.

gamontoya commented 7 years ago

Note: Waiting on FileUse modeling

gamontoya commented 6 years ago

Defer for Batch/Import/Export Working Group.