ucsdlib / dams5-cc-pilot

A repository for doing shared R&D on CurationConcerns for the Development team.
MIT License
0 stars 0 forks source link

File format recommendation for ingest - Excel vs. CSV vs. Tab delimited vs. Google Sheets #34

Closed arwenhutt closed 8 years ago

arwenhutt commented 8 years ago

@ucsdlib/domm will make a recommendation on what format would be best to use for ingest.

There are some initial notes on LiSN https://lisn.ucsd.edu/display/DOMM/DAMS+5+-+Deep+Thoughts+-+Standard+Input+Streams

arwenhutt commented 8 years ago

I'll schedule a meeting for DOMM to get the ball started on this.

arwenhutt commented 8 years ago

@lsitu if you have any thoughts on the relative merits of the different approaches (especially excel vs. tab delimited I think), that would be very helpful. Thanks!

mcritchlow commented 8 years ago

I'll only note that if we can keep it to a non-proprietary format (CSV/TSV) that would be the most reusable for other interested folks in the community.

lsitu commented 8 years ago

@arwenhutt I think CSV format looks good, which have more general support over diacritics and special characters. @mcritchlow and I talked about it last week and we are doing some research on supporting it over CC now. We could support other formats like Excel (Google Sheets?) too if you want it, which works for special characters as well but could be easy to cause corrupted diacritics. For tab delimited format, it has the limitation on special characters like carriage return (new line), tab etc.

remerjohnson commented 8 years ago

I should probably mention that I've gotten good results by converting finished Excel metadata sheets into csv via csvkit, which with one command converts Excel to csv (I'd imagine the process would be similar going from Google Sheets).

The only thing that didn't seem to come over clean were the leading zeros in Unique Object IDs. Surprisingly, newlines come over clean but as @lsitu we'd probably have a lot of things like that look out for in the new tool.

arwenhutt commented 8 years ago

How does csv handle punctuation in values? Do conversion processes from excel or sheets automatically escape them?

remerjohnson commented 8 years ago

@arwenhutt Depends on the conversion. In general I believe it puts quotes around all cell values in order to avoid it making incorrect delimiters if it encounters a comma or newline. And in csvkit you can specifiy the delimiter (comma is default, so for tsv you just specify it with a -t flag).

Sadly since there is no csv standard, I don't really know how it handles punctuation natively. I just know to never use Excel for csv files, as not only does it parse it incorrectly, it actively changes the file when it is opened/saved. LibreOffice (Calc) is the best bet for opening and editing csv as far as I know.

I should look more into the Google Sheets conversion process to csv; I had only converted it to Excel before.

lsitu commented 8 years ago

@arwenhutt In CSV, it uses two double quotes to escape the double quotes, and any values that includes special characters like new line or comma are wrapped with double quotes. Yes, conversion processes from excel or sheets automatically escape special characters. I only see sometimes we could get corrupted diacritics if it wasn't handled correctly during typing or parsing with Excel, and it could be an issue when writing/encoding diacritics to Excel in different environments.

arwenhutt commented 8 years ago

I think the escaping would need to be automatic and fairly reliable because it's not possible for us to hand edit all instances of punctuation in metadata files, or expect data providers to do so.

It sounds like another consideration is which tools it would allow/disallow us to use, and the implications for workflow. If I'm understanding correctly, it sounds like @remerjohnson you're saying that we can use the csvkit works well to convert an excel file to comma separated, but there are problems going from csv to excel (does csvkit work that direction too?).

I haven't used libre office, but I don't think we can afford to not have excel as an option especially for working with complex, messy data sets. But if we can use excel for creating the master copy, and then convert to csv for ingest, that seems like it could work. It can make small edits to the file based on validation kind of a pain (edit excel, convert, re-validate, if there are still problems, edit excel again, convert again, re-validate, etc.) but it seems a reasonable cost for some of the other advantages.

ghost commented 8 years ago

I've found that going from CSV or TSV to a spreadsheet (Excel, OpenOffice, LibreOffice, Google) can result in character encoding problems. You have to be very careful to set the correct encoding in the text import wizard.

I see the merit of moving to a generic file format, like CSV or TSV, but Excel files are just as ubiquitous. OpenOffice, LibreOffice, Google have the ability save to .xlsx. Since we typically wrangle data in Excel, leaving it in that format for ingest would the path of least resistance (and fewer unintended changes). Once converted to CSV or TSV, the data is very hard to proofread. You have to have faith that the parsing is clean.

remerjohnson commented 8 years ago

Well, let me clarify. Getting Excel into csv is pretty lossless, as is csv to Excel (there's many ways to accomplish this, but my main method is to clean/wrangle the metadata we get in a csv in OpenRefine first, and it can output to Excel easily to get started on the template wrangling).

@GregReser you have these problems because of how Windows/Excel is treating the csv, not any inherent weakness in csv itself. Windows/Excel wants to make it a "Microsoft Excel Comma Separated Values File". LibeOffice would present no issues.

I don't believe we'd be able to contribute anything to the community if we stuck with a tool that could only handle Excel. I'll investigate more of the Google Sheets angle for now.

arwenhutt commented 8 years ago

Yeah, it sounds like CSV/TSV to Excel isn't a good option.

Excel to CSV (with a conversion tool), sounds like it might still be possible, but we would want to be pretty confident about character encoding in whatever path we're using.

From the metadata side, we would probably still want to primarily "deal in" excel with data providers (or sheets when that works for the project) an only convert to csv as an ingest copy.

@GregReser 's question is a good one to consider though, what are the gains of using csv instead of excel in this scenario then? It seems like the main one is using a non-proprietary format, but perhaps there are others. I think it might be worth documenting these issues in a slightly more structured way, so I'm going to see about creating a quick pro/con table in the wiki here (unless someone wants it elsewhere) and seed it with what I can find from our various conversations.

remerjohnson commented 8 years ago

@arwenhutt csv to Excel can happen. I think we're focusing a bit too much on using only one tool here (Excel), and that isn't necessary. I use csv along the process at different points currently.

What I mean is I don't think this is a zero sum, all-or-nothing thing where we tally the most "pros".

What csv deposit would get us is the potential to gemify this tool to share with Hydra. That seems like a worthy goal to me but I admit I don't know if that is a dev goal or not.

arwenhutt commented 8 years ago

Agreed. That's what I was trying to get at a couple of comments ago, that we could continue to use Excel (or others) as a format in which we work, but then we convert to csv for ingest.

My only concern is that we be able to be fairly confident about the reliability of the conversion from our main working tools to the ingest format so that we aren't loosing character encodings, punctuation, etc. But it sounds like that is something we can probably be confident about.

arwenhutt commented 8 years ago

Okay, not sure if it's useful but I started a stub pro and con list: https://github.com/ucsdlib/dams5-cc-pilot/wiki/Pros-cons-of-ingest-formats

Add if you want/think it would be useful, though honestly it seems from the course of the conversation that we might be moving towards a decision.

Trying to synthesize some of this, let me know if any of these statements aren't true:

  1. It would be valuable to use a non-proprietary format for ingest to improve community wide collaboration, use, etc.
  2. Of the non-proprietary formats, CSV is the best able to support a wide variety of character encodings.
  3. Excel or other formats can still be used for description and wrangling.
  4. Excel can be converted to CSV without data loss*

* I don't mean weird excel specific functions, but data - values with their punctuation and character encodings

ghost commented 8 years ago

I'm fine with all of that. Moving to CSV only adds new one step to the ingest process and if it makes the tool more usable to the Hydra community, then it makes sense.

remerjohnson commented 8 years ago

I tested using my Google Drive template with specific test data, and the resulting csv export from Sheets was clean. (Again, csv will only output the "current sheet", and leading zeroes in unique object IDs were gone).

So Google Sheets could still be a route -- the obvious gains there being easy collaboration, automatic versioning, while retaining all the Excel-like validation we have (or via scripting if we're feeling adventurous). You would then not need csvkit to transform it, just an export from Sheets, via File -> Download as -> csv

arwenhutt commented 8 years ago

@remerjohnson what do you think? do you feel pretty comfortable that those four points above are true (especially #4)?

remerjohnson commented 8 years ago

@arwenhutt Yes, although point 2 (and especially 4) has a very big "yes, BUT" in that you have to be careful how you're generating that csv. If you do all your wrangling in Excel, then at the end simply save as csv, you're going to have problems because of how Excel encodes that csv, which I believe is ISO-8859-1, not UTF-8.

Some accented or special characters, then, can be garbled since our tool most likely is expecting UTF-8. Also, in general, Excel just doesn't have good support for csv, and it will behave weirdly. It is not a good situation.

Safe ways of going from Excel to csv so far are csvkit's in2csv command, or the File -> Download as -> csv option from Google Sheets, which both seem to force UTF-8 output. And of course OpenRefine.

arwenhutt commented 8 years ago

@remerjohnson I think it's reasonable to have a specific path or process to generate csv. But I do think it needs to integrate relatively well into our current work environments. It sounds like Google Sheets and OpenRefine are good with integrated tools.

For Excel, can you look into the implementation requirements for csvkit?

Thanks!

remerjohnson commented 8 years ago

@arwenhutt It's also an option to open your completed Excel sheet in LibreOffice, then "Save as" csv, setting the correct options like UTF-8 and "all fields quoted". Does that work better within the Excel environment? amazonia_csv_export_save_as

arwenhutt commented 8 years ago

I don't know, I haven't used either tool ;)

There just needs to be an option which produces reliable csv output from excel, it sounds like we have two options, so that's good.

To clarify a little further, what if your data doesn't contain accents, non-roman characters, etc. Is the direct Excel save as option reliable in that case, or are there still issues with punctuation.

remerjohnson commented 8 years ago

Right, from a personal data quality / confidence perspective, I would not want to feed the tool data in encoding it's not expecting, so I would always use something to safely convert it, not Excel.

Might be a little paranoid but I don't know all the little differences in encodings, nor the code that will ultimately process the data.

To clarify a little further, what if your data doesn't contain accents, non-roman characters, etc. Is the direct Excel save as option reliable in that case, or are there still issues with punctuation.

ghost commented 8 years ago

Just for the hell of it, I have to ask: have we ever had an encoding problem using Excel files in the current ingest stream? I know we had a problem with date formats and we had to be careful setting those cells to text (and fixing the mess when doing so). How about character encoding? Have we seen any mangled encoding on ingest - the text looked correct in Excel (we got it in there somehow and were happy with it) but then it got messed up on ingest?

remerjohnson commented 8 years ago

@GregReser No, not many, but the filenames with accented characters problem did crop up which made files outright fail to ingest. And to be honest, none of my collections have had particularly exotic characters, so it is hard to know for sure.

The ingest tool was written around Excel, so it's not surprising that Excel input works well. The problems we have with encoding now are usually not the tool's fault: it's the source data we get, which can be hard to catch without some analysis.

ghost commented 8 years ago

It seems like we are looking for a solution to a problem that we don't have - yet.

Some new points to see if we agree on:

  1. Raw CSV data is hard to read and we will want to view it in some other format to analyze and edit it.
  2. The most likely format we will view/edit CSV data in is a table.
  3. OpenRefine, Excel, OpenOffice, LibreOffice all display tables. OpenRefine is doing a whole lot more with facets and all, but after you edits are done, you see a table view (not 100% confident I know what I'm talking about here).
  4. When you finish editing your table, you feel confident the values are correct and you expect the same values to be ingested into the DAMS.
remerjohnson commented 8 years ago

@GregReser Not sure what you mean by "solution to problem we don't have". The tool needs to be built. Do you mean you're agreeing that you're fine with this tool being made to ingest csv, not Excel (and that Excel can still be used along the process if we want)? Because that I agree with.

ghost commented 8 years ago

@remerjohnson I mean we've successfully figured out how to ingest .xlsx files but we're putting a lot of effort into figuring out how to ingest .csv files and talking about adding more steps to the workflow.

Honestly, I think .csv files are fine. We'll adapt our workflow to whatever is best for the long term and for the larger community. I was just having a moment where I had to wonder why we are changing something that works. Probably because we have to rebuild it for Hydra, so why not change it for the better. I guess .csv is better, but it isn't easier.

arwenhutt commented 8 years ago

Thanks for all the great discussion guys.

I think I have enough information to make a DOMM recommendation for CSV as an ingest format.

As far as the Excel to CSV conversion, it sounds like there are a number of possible paths. I'll be creating a ticket to look at investigating the options, making recommendations for conversion tools/paths, and creating documentation.