ucsdlib / damspas-rd

A Digital Collections application based on Hyrax
MIT License
3 stars 2 forks source link

Create initial delimited import template #46

Closed mcritchlow closed 6 years ago

mcritchlow commented 7 years ago

Descriptive summary

Create an initial csv import template to be used for initial tool R&D.

Focusing on:

  1. hierarchical components
  2. linked records
  3. CSV vs TSV vs Excel - is there a preferred format that will make life easier for DOMM and Dev?

Deliver:

Dependency

111

Related work

Ticket migrated from https://github.com/ucsdlib/dams5-cc-pilot/issues/39

ghost commented 7 years ago

@mcritchlow for "2. linked records", do you mean including URIs for concepts, like names and subjects?

mcritchlow commented 7 years ago

@GregReser yeah, exactly.

ghost commented 7 years ago

Would URI's be required? We would need a pre-ingest tool to generate ARKs for all concepts already in the DAMS, then we would find URI's for the rest?

ghost commented 7 years ago

Maybe the fastest way to completing this initial import tool would be to duplicate the functions of the current tool, with a few improvements. For URIs, we could do something similar to our pre-OLR ingest subject import, but expand it to agents.

mcritchlow commented 7 years ago

So that's a good question, and I think maybe gets at how the batch import tool fits into the overall DOMM workflow where you're presumably using something like Open Refine first. I'm not sure we need a separate tool. I think, similar to what we do now (though @lsitu would know better), we could do something like:

if cellValue is a URI
  try to lookup/match to an existing URI in the system
  if it exists, link to the existing URI
  if it doesn't exist. create a new local URI
if cellValue is a String
  try to find a URI using CV list (LCSH, Getty, etc.)
  if one is found, use it
  if not, create a new URI

Another variant on the String path is to just directly create a URI instead of doing remote lookup. This would be faster, of course. It would largely depend on whether you all would have done this ahead of time in OpenRefine or a similar tool. If you had, having our ingest tool do the lookup would be redundant and unnecessarily slow things down.

mcritchlow commented 7 years ago

@GregReser - I think that's a good practical approach.

Just noting here that if we do choose to use Excel as the file format, instead of CSV/TSV, I checked with Alex and for a different tool he used the following gems were useful:

lsitu commented 7 years ago

Thanks @mcritchlow for the information. @GregReser I think starting from something similar to our current support sounds good. Just FYI, here is a simple CSV example with just several fields that I set up for the purpose of testing the potential CSV batch ingest support in DAMS5: https://github.com/ucsdlib/horton/blob/develop/spec/fixtures/template_csv_import.csv

remerjohnson commented 7 years ago

The current workflow for OpenRefine reconciliation I have on a per-collection basis is if I have more than about 30 subjects, it is faster to just reconcile those in OpenRefine (otherwise there's manual lookups, recording the URIs, putting them in our Google Sheet post-ingest, etc.). And of course the larger the amount of subjects, the saved time goes up dramatically.

arwenhutt commented 7 years ago

Just another vote for sticking to current functionality (by and large) and leave external reconciliation tools for later.

One of the things my review of the subject/name files is reminding me of is how unreliable automated reconciliation can be. Even when there's an exact or strong string match, it can be an incorrect match, especially for names. I think for now erring on the side of more manual and less automated, is prudent.

mcritchlow commented 7 years ago

I agree w/ @arwenhutt, and it sounds like we all do as well.

Given the conversations this morning on slack, do we want to, for now, also say we'll also be sticking with Excel .xlsx as the only supported file type for ingest? At least, for the MVP release of the batch import tool?

It sounds like the most directly useful file format to initially support, given the current DOMM workflow?

ghost commented 7 years ago

Yes, for now we should stick with the current functionality for ingest: string match against existing DAMS ARKs. I would advocate for updating the subject import tool soon though so we can increase the speed and accuracy of that process. We will also want to expand it to import Agents.

I'm OK with sticking with .xlsx for now.

mcritchlow commented 7 years ago

@GregReser - updating the existing subject import tool in DAMS4 or are you referring to the need for one in the new system? I'm trying to sort out what to do with that re: new ticket(s)

ghost commented 7 years ago

@mcritchlow Yes, in the new system. I'd like to suggest a couple of improvement that might not require a lot of dev time.

mcritchlow commented 7 years ago

@GregReser Ok, so that's probably a separate ticket in this repo then for re-creating that feature. I'll get that setup.

In the context of this ticket, it sounds like we can proceed assuming the tool will be available to fill the role it currently does in the ingest process?

remerjohnson commented 7 years ago

I find supporting .xslx only suboptimal but it seems I'm being outvoted, so we can proceed on that.

ghost commented 7 years ago

@mcritchlow Yes, I can proceed with this ticket.

ghost commented 7 years ago

For Dates, the range is edm:TimeSpan and we using 3 properties label, begin, end. I think we want to have these correctly associated with each other if there are multiple date predicates.

We could have three columns for each date type: Date:created Date:createdBegin Date:createdEnd

I'm thinking to minimize the number of columns by placing the three property values in a single cell: label (begin | end) Example: Date:created = "between September 21, 2014 and April 5, 2017 (2014-09-21 | 2017-04-04)"

Any preference from the DEV or DOMM workflow perspective?

lsitu commented 7 years ago

@GregReser Are we still going to use the pie ( | ) for multiple values delimiting? If we are going to follow the current convention in dams4, then we've done something similar for Related Resource as cell delimiting with @ for multiple properties (description @ uri).

ghost commented 7 years ago

@lsitu I need to confer with others before I say how we will delimit multiple properties in the template. I willget you an answer as soon as possible.

I will say that, because the the new template will be created and reviewed by people before ingest, we will want it to be human-readable, like the current ingest template. This means some cell values will be user-friendly labels that are mapped to a URI in ingest or trigger the creation additional properties. Some of the columns that will function this way are: File Use Type of Resource Language

There will be new columns with this function, but I need a little time to write those out.

remerjohnson commented 7 years ago

I think what the issue is that pipe | is a "global" trigger to make things multiple values in the sheet.

Therefore, we could do your thing above but instead use an @ :

between September 21, 2014 and April 5, 2017 @ 2014-09-21 2017-04-04

But of course you could adjust the syntax as needed for convenience/clarity

ghost commented 7 years ago

@lsitu We will place multiple properties in a single cell as formatted text. For now, only a couple of columns will use this, but we should plan for the future in case we use in in more columns. It would be best to use the same cell format, so metadata preparation, review, and code development, are easier.

To make the cell text visually easier to understand, it would be good to clearly separate the label from the other properties. It might also help to use name:value pairs for the properties. This might help parsing on your end (you don't have to rely on the position of the value in an array, you can use the key value pair). One concern is that a label might contain the character(s) used to indicate properties. Here is my proposal:

label @{name:value | name:value}

date:creation between January 1 - August 3, 1925 @{begin:1925-01-01 | end:1925-08-03}

related resource:depiction 1925 Trip - Album @{uri:http://library.ucsd.edu/dc/object/bb35511535}

Questions:

  1. Is "@{" a good way to uniquely indicate that the text afterwards are key value pairs?
  2. I used " | " to separate name:value pairs hoping it will be unique (won't appear in the label)

Do you see any problems from the coding point of view? @arwenhutt @remerjohnson any thoughts on our workflow?

remerjohnson commented 7 years ago

@GregReser I do think, as I and Longshou noted above, that we should keep the pipe as a "global" multi-value indicator, while @ is more of a per-cell rule, therefore, we'd have to find some other way to delimit the begin/end date values within the value. I'm sure @lsitu might have some ideas on that.

ghost commented 7 years ago

I was thinking the name:value pairs were repeated within the { } so pipe made sense. That's a pretty loose definition though :) A comma is too common to work. I don't think we want to place the value in quotes, e.g., begin:"1925-01-01", because we don't use quotes anywhere else and it might cause confusion (When do I have to use quotes?)

Is " , " safe?
Other ideas?

remerjohnson commented 7 years ago

@lsitu can correct me if I'm wrong, but since we are calling out key-value pairs, begin: and end: seem to be fine to indicate a value is bound to those keys. But again I may be wrong there. So:

date:creation
between January 1 - August 3, 1925 @{begin:1925-01-01 end:1925-08-03}
lsitu commented 7 years ago

@remerjohnson @GregReser It does sound good to me for supporting a simple case like that. But if we are going to use name/value pairs used in cell properties after @ with a more general support, maybe we can balance a machine readable format like JSON ({key1:"value1", key2:"value2", ...} ), Hash ({:key1=>'value1', :key2=>'value2', ...}) etc.

ghost commented 7 years ago

@lsitu I'm concerned that placing the value in quotes might cause confusion because we don't use quotes around values anywhere else. The questions would always arise: "When do I have to use quotes?"

remerjohnson commented 7 years ago

I think that the context of it being within the curly brackets {} would allow that to work (i.e., those brackets set up basically JSON syntax rules.

But agree if that isn't the case, I wouldn't want to always have to escape double quotes like \"\"

lsitu commented 7 years ago

Yes, I agree. It's just related to how we interpret the values within the curly brackets {}. It will be nice to use a pattern that is more human readable and machine readable.

ghost commented 7 years ago

Yes, it's inspired by JSON, but made more human readable and following some of the conventions of our template.

I would like to avoid using quotes because it adds another rule and some clutter for humans. @lsitu would it be possible to parse this, as Ryan suggested? @{begin:1925-01-01 end:1925-08-03}

In this case, the keys are the separators

lsitu commented 7 years ago

@GregReser Yes, it'll work for the case like @{begin:1925-01-01 end:1925-08-03}. But we may need to know about the keys in advance when we are making it "the keys are the separators". Do you have other use cases for the using of key/value pairs in the curly brackets {}? And do you have any thoughts regarding extending the use of this key/value pairs encoding in the future?

ghost commented 7 years ago

I think we should apply this to related resource to be consistent. That will probably be all right now.

Locus 920, Area F, Area Stratum BR, BEDROCK CELL 8 @{uri:http://library.ucsd.edu/ark:/20775/bb7987903k}

We should have a controlled set of keys and these should part of the pre-ingest validation. If an unrecognized key within {} is detected, an error should be reported. I will supply this list of keys with the new template.

mcritchlow commented 7 years ago

Feel free to add that requirement (controlled key list) in the description of this ticket as you see fit @GregReser, I think it would be good to have detailed there.

ghost commented 7 years ago

Last (I hope) question before I pass the first draft of the new template for dev...

Do we have a need for more than 2 files per object/component/sub-component? Should we fold file and file use into a single column so we can repeat it as many times as necessary?

myFile.tif @{fileUse:image-source}

lsitu commented 7 years ago

@GregReser It sounds good to me. What's the rules to identify all the keys within {}? When moving forward with the fileUse properties, it looks like we may need to map dams4 fileUse properties to PCDM file use (http://pcdm.org/2015/05/12/use) and add additional file use properties for DAMS5. @mcritchlow Do you have any suggestions on it?

mcritchlow commented 7 years ago

No specific suggestions other than agreeing that if we don't already have a mapping for those properties we should do so. Probably in the data dictionary? I think @remerjohnson may have started looking into this?

arwenhutt commented 7 years ago

@GregReser Do we have a need for more than two files per component?

My inclination is to leave values in separate columns as much as possible, unless there's a compelling need for a more complex encoding.

remerjohnson commented 7 years ago

@mcritchlow I have, but a complete mapping is not possible now since we don't just have the idea of "use" in relation to objects, but also specific rules about what happens to certain things like documents/audio/video. So if a rough mapping of say document-service -> pcdmuse:ServiceFile and data-service -> pcdmuse:ServiceFile is fine (note the many cases of overlap and lossiness), I could do that mapping.

A missing piece here is PCDM was treating a thing we conflate as separate things, and were going to work on "Object types and Profiles", which apparently is still a work in progress.

ghost commented 7 years ago

I've had one collection that wanted more than two files per component, but we ended up zipping the files into one, so it was not a problem.

Agreed - we'll stick with the current separate columns for file and file use.

ghost commented 7 years ago

@lsitu In the dams4.2 data dictionary Google sheet, we had specified a way to format identifiers. Do you recall if we are going to use that or were there discussions about modifying it? Example: urn:x-public-id:doi:10.1000/182

For the ingest spreadsheet, there will be a column for each identifier type, e.g., identifier:doi and the cell value will be just the identifier value, e.g., 10.1000/182. On ingest, the complete formatted string will be constructed. So my question is about how to specify what the final ingested value should be.

https://docs.google.com/spreadsheets/d/1OjTm1Kuzo-An-THdpUkjwa5puul-b4Lx3SK0lB6rvl8/edit?usp=sharing

lsitu commented 7 years ago

@GregReser I don't recall that we have any further plans to discuss and modifying it so I assume that we are going to use it. And as your example showing, the values in the "URN Example" column should be the format of the final ingested value of the identifiers. @mcritchlow Could you correct me if it's wrong?

mcritchlow commented 7 years ago

@lsitu Yep, I think that's correct as far as the current state of that discussion.

I admittedly wonder if it's ultimately easier to just have separate properties for each identifier type, for dealing with things like indexing, display order, access control, etc. But I can also see the value in just moving forward with what's already documented.

remerjohnson commented 7 years ago

@mcritchlow Yeah we had a little back-and-forth on that. Like listing bare ORCIDs in its own field seems... disjointed unless those could get labels pulled

mcritchlow commented 7 years ago

@remerjohnson What do you mean by labels pulled? The current spreadsheet seems to be using a static/fixed label for each identifier type. So every ORCID will be displayed as:

ORCID http://orcid.org/0000-0003-3270-1306

Or something. The ORCID label could be whatever we wanted, but it would be the same display label for every ORCID instance, right?

remerjohnson commented 7 years ago

@mcritchlow yeah going to think about this more. ORCIDs should obviously be binded to Agents/authorities, so going to think about how that would happen in the template

mcritchlow commented 7 years ago

So @remerjohnson & I had a quick Slack convo about this. I think ORCID's are a bit of a special case here compared to other identifier types, and I wonder if they should be bracketed off.

Primarily because ORCID's are not associated with Objects/Works, they are associated with People/Organizations (Agents). So I think the assertion that Ryan made that it's disjoint is correct. Because it seems odd to model an Object as having three creators and then three ORCIDs. More accurate would seem to be an object that links to three researchers (Agents) that themselves have ORCIDs.

ghost commented 7 years ago

It does seem like we should place ORCID in the Agent record. Can we create a way to display this in the new UI? It could display beside the agent name, as a hover, as an "ORCID" button. It could also be part of a more comprehensive "info" pop-up that displays other things, like a Wikipedia snippet.

If we could add this to the new UI from the beginning, it would solve a couple of problems:

  1. Clearly associate ORCID with Agent label
  2. Provide the visible benefit of linked data we've been asking for. We would be a shining example of linked data in real world use.
mcritchlow commented 7 years ago

@GregReser Yeah, I mean I defer to @lsitu, but I think we're talking about indexing and display rules at that point. Similar to how we display components in the object view.

lsitu commented 7 years ago

@GregReser I agree with Matt to talk about the UI issues later at the point with indexing and display rules. But are we going to add the ORCID to the Object or the Agent? If in the agent, which field for ORCID?

ghost commented 7 years ago

I just bring up implementing the ORCID display in the UI sooner rather than later so it's not hidden metadata. Kind of planting the idea as a priority.

ORCID will go in the Agent authority record as closeMatch.

mcritchlow commented 7 years ago

Is there a predicate/namespace for ORCID we could use instead of closeMatch? If something like that already exists, might be worth using and more explicit? It might also make targeting ORCID vs other properties in the Agent class a bit easier for specific display/indexing rules. But @lsitu would know better if that's an actual concern or not.