Closed arwenhutt closed 8 years ago
I'll schedule a meeting for DOMM to get the ball started on this.
@lsitu if you have any thoughts on the relative merits of the different approaches (especially excel vs. tab delimited I think), that would be very helpful. Thanks!
I'll only note that if we can keep it to a non-proprietary format (CSV/TSV) that would be the most reusable for other interested folks in the community.
@arwenhutt I think CSV format looks good, which have more general support over diacritics and special characters. @mcritchlow and I talked about it last week and we are doing some research on supporting it over CC now. We could support other formats like Excel (Google Sheets?) too if you want it, which works for special characters as well but could be easy to cause corrupted diacritics. For tab delimited format, it has the limitation on special characters like carriage return (new line), tab etc.
I should probably mention that I've gotten good results by converting finished Excel metadata sheets into csv via csvkit, which with one command converts Excel to csv (I'd imagine the process would be similar going from Google Sheets).
The only thing that didn't seem to come over clean were the leading zeros in Unique Object IDs. Surprisingly, newlines come over clean but as @lsitu we'd probably have a lot of things like that look out for in the new tool.
How does csv handle punctuation in values? Do conversion processes from excel or sheets automatically escape them?
@arwenhutt Depends on the conversion. In general I believe it puts quotes around all cell values in order to avoid it making incorrect delimiters if it encounters a comma or newline. And in csvkit
you can specifiy the delimiter (comma is default, so for tsv you just specify it with a -t flag).
Sadly since there is no csv standard, I don't really know how it handles punctuation natively. I just know to never use Excel for csv files, as not only does it parse it incorrectly, it actively changes the file when it is opened/saved. LibreOffice (Calc) is the best bet for opening and editing csv as far as I know.
I should look more into the Google Sheets conversion process to csv; I had only converted it to Excel before.
@arwenhutt In CSV, it uses two double quotes to escape the double quotes, and any values that includes special characters like new line or comma are wrapped with double quotes. Yes, conversion processes from excel or sheets automatically escape special characters. I only see sometimes we could get corrupted diacritics if it wasn't handled correctly during typing or parsing with Excel, and it could be an issue when writing/encoding diacritics to Excel in different environments.
I think the escaping would need to be automatic and fairly reliable because it's not possible for us to hand edit all instances of punctuation in metadata files, or expect data providers to do so.
It sounds like another consideration is which tools it would allow/disallow us to use, and the implications for workflow. If I'm understanding correctly, it sounds like @remerjohnson you're saying that we can use the csvkit works well to convert an excel file to comma separated, but there are problems going from csv to excel (does csvkit work that direction too?).
I haven't used libre office, but I don't think we can afford to not have excel as an option especially for working with complex, messy data sets. But if we can use excel for creating the master copy, and then convert to csv for ingest, that seems like it could work. It can make small edits to the file based on validation kind of a pain (edit excel, convert, re-validate, if there are still problems, edit excel again, convert again, re-validate, etc.) but it seems a reasonable cost for some of the other advantages.
I've found that going from CSV or TSV to a spreadsheet (Excel, OpenOffice, LibreOffice, Google) can result in character encoding problems. You have to be very careful to set the correct encoding in the text import wizard.
I see the merit of moving to a generic file format, like CSV or TSV, but Excel files are just as ubiquitous. OpenOffice, LibreOffice, Google have the ability save to .xlsx. Since we typically wrangle data in Excel, leaving it in that format for ingest would the path of least resistance (and fewer unintended changes). Once converted to CSV or TSV, the data is very hard to proofread. You have to have faith that the parsing is clean.
Well, let me clarify. Getting Excel into csv is pretty lossless, as is csv to Excel (there's many ways to accomplish this, but my main method is to clean/wrangle the metadata we get in a csv in OpenRefine first, and it can output to Excel easily to get started on the template wrangling).
@GregReser you have these problems because of how Windows/Excel is treating the csv, not any inherent weakness in csv itself. Windows/Excel wants to make it a "Microsoft Excel Comma Separated Values File". LibeOffice would present no issues.
I don't believe we'd be able to contribute anything to the community if we stuck with a tool that could only handle Excel. I'll investigate more of the Google Sheets angle for now.
Yeah, it sounds like CSV/TSV to Excel isn't a good option.
Excel to CSV (with a conversion tool), sounds like it might still be possible, but we would want to be pretty confident about character encoding in whatever path we're using.
From the metadata side, we would probably still want to primarily "deal in" excel with data providers (or sheets when that works for the project) an only convert to csv as an ingest copy.
@GregReser 's question is a good one to consider though, what are the gains of using csv instead of excel in this scenario then? It seems like the main one is using a non-proprietary format, but perhaps there are others. I think it might be worth documenting these issues in a slightly more structured way, so I'm going to see about creating a quick pro/con table in the wiki here (unless someone wants it elsewhere) and seed it with what I can find from our various conversations.
@arwenhutt csv to Excel can happen. I think we're focusing a bit too much on using only one tool here (Excel), and that isn't necessary. I use csv along the process at different points currently.
What I mean is I don't think this is a zero sum, all-or-nothing thing where we tally the most "pros".
What csv deposit would get us is the potential to gemify this tool to share with Hydra. That seems like a worthy goal to me but I admit I don't know if that is a dev goal or not.
Agreed. That's what I was trying to get at a couple of comments ago, that we could continue to use Excel (or others) as a format in which we work, but then we convert to csv for ingest.
My only concern is that we be able to be fairly confident about the reliability of the conversion from our main working tools to the ingest format so that we aren't loosing character encodings, punctuation, etc. But it sounds like that is something we can probably be confident about.
Okay, not sure if it's useful but I started a stub pro and con list: https://github.com/ucsdlib/dams5-cc-pilot/wiki/Pros-cons-of-ingest-formats
Add if you want/think it would be useful, though honestly it seems from the course of the conversation that we might be moving towards a decision.
Trying to synthesize some of this, let me know if any of these statements aren't true:
* I don't mean weird excel specific functions, but data - values with their punctuation and character encodings
I'm fine with all of that. Moving to CSV only adds new one step to the ingest process and if it makes the tool more usable to the Hydra community, then it makes sense.
I tested using my Google Drive template with specific test data, and the resulting csv export from Sheets was clean. (Again, csv will only output the "current sheet", and leading zeroes in unique object IDs were gone).
So Google Sheets could still be a route -- the obvious gains there being easy collaboration, automatic versioning, while retaining all the Excel-like validation we have (or via scripting if we're feeling adventurous). You would then not need csvkit
to transform it, just an export from Sheets, via File -> Download as -> csv
@remerjohnson what do you think? do you feel pretty comfortable that those four points above are true (especially #4)?
@arwenhutt Yes, although point 2 (and especially 4) has a very big "yes, BUT" in that you have to be careful how you're generating that csv. If you do all your wrangling in Excel, then at the end simply save as csv, you're going to have problems because of how Excel encodes that csv, which I believe is ISO-8859-1, not UTF-8.
Some accented or special characters, then, can be garbled since our tool most likely is expecting UTF-8. Also, in general, Excel just doesn't have good support for csv, and it will behave weirdly. It is not a good situation.
Safe ways of going from Excel to csv so far are csvkit
's in2csv
command, or the File -> Download as -> csv
option from Google Sheets, which both seem to force UTF-8 output. And of course OpenRefine.
@remerjohnson I think it's reasonable to have a specific path or process to generate csv. But I do think it needs to integrate relatively well into our current work environments. It sounds like Google Sheets and OpenRefine are good with integrated tools.
For Excel, can you look into the implementation requirements for csvkit?
Thanks!
@arwenhutt It's also an option to open your completed Excel sheet in LibreOffice, then "Save as" csv, setting the correct options like UTF-8 and "all fields quoted". Does that work better within the Excel environment?
I don't know, I haven't used either tool ;)
There just needs to be an option which produces reliable csv output from excel, it sounds like we have two options, so that's good.
To clarify a little further, what if your data doesn't contain accents, non-roman characters, etc. Is the direct Excel save as option reliable in that case, or are there still issues with punctuation.
Right, from a personal data quality / confidence perspective, I would not want to feed the tool data in encoding it's not expecting, so I would always use something to safely convert it, not Excel.
Might be a little paranoid but I don't know all the little differences in encodings, nor the code that will ultimately process the data.
To clarify a little further, what if your data doesn't contain accents, non-roman characters, etc. Is the direct Excel save as option reliable in that case, or are there still issues with punctuation.
Just for the hell of it, I have to ask: have we ever had an encoding problem using Excel files in the current ingest stream? I know we had a problem with date formats and we had to be careful setting those cells to text (and fixing the mess when doing so). How about character encoding? Have we seen any mangled encoding on ingest - the text looked correct in Excel (we got it in there somehow and were happy with it) but then it got messed up on ingest?
@GregReser No, not many, but the filenames with accented characters problem did crop up which made files outright fail to ingest. And to be honest, none of my collections have had particularly exotic characters, so it is hard to know for sure.
The ingest tool was written around Excel, so it's not surprising that Excel input works well. The problems we have with encoding now are usually not the tool's fault: it's the source data we get, which can be hard to catch without some analysis.
It seems like we are looking for a solution to a problem that we don't have - yet.
Some new points to see if we agree on:
@GregReser Not sure what you mean by "solution to problem we don't have". The tool needs to be built. Do you mean you're agreeing that you're fine with this tool being made to ingest csv, not Excel (and that Excel can still be used along the process if we want)? Because that I agree with.
@remerjohnson I mean we've successfully figured out how to ingest .xlsx files but we're putting a lot of effort into figuring out how to ingest .csv files and talking about adding more steps to the workflow.
Honestly, I think .csv files are fine. We'll adapt our workflow to whatever is best for the long term and for the larger community. I was just having a moment where I had to wonder why we are changing something that works. Probably because we have to rebuild it for Hydra, so why not change it for the better. I guess .csv is better, but it isn't easier.
Thanks for all the great discussion guys.
I think I have enough information to make a DOMM recommendation for CSV as an ingest format.
As far as the Excel to CSV conversion, it sounds like there are a number of possible paths. I'll be creating a ticket to look at investigating the options, making recommendations for conversion tools/paths, and creating documentation.
@ucsdlib/domm will make a recommendation on what format would be best to use for ingest.
There are some initial notes on LiSN https://lisn.ucsd.edu/display/DOMM/DAMS+5+-+Deep+Thoughts+-+Standard+Input+Streams