hurlbertlab / dietdatabase

Creative Commons Zero v1.0 Universal
10 stars 9 forks source link

GloBI data review of Avian Diet Database: unable to parse diet data records #141

Closed jhpoelen closed 4 years ago

jhpoelen commented 4 years ago

I just noticed in an automated GloBI data review of Avian Diet Database that GloBI was unable to parse the Avian data diet data on commit https://github.com/hurlbertlab/dietdatabase/commit/ff0ebeaf44cc9af9e9c657013ec3cfc4bb840d98 .

Please let me know if anything on the GloBI side needs to be updated to resolve this integration issue.

ahhurlbert commented 4 years ago

I’ve tried checking over the file and recent edits but don’t see what’s causing this error. Can you provide any guidance?

From: Jorrit Poelen notifications@github.com Sent: Monday, August 31, 2020 11:36 AM To: hurlbertlab/dietdatabase dietdatabase@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [hurlbertlab/dietdatabase] GloBI data review of Avian Diet Database: unable to parse diet data records (#141)

I just noticed in an automated GloBI data review of Avian Diet Database that GloBI was unable to parse the Avian data diet data on commit ff0ebeahttps://github.com/hurlbertlab/dietdatabase/commit/ff0ebeaf44cc9af9e9c657013ec3cfc4bb840d98 .

Please let me know if anything on the GloBI side needs to be updated to resolve this integration issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/hurlbertlab/dietdatabase/issues/141, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAUTEDX5YO4A4DA4FXHR2BDSDO7OPANCNFSM4QQUZE7Q.

jhpoelen commented 4 years ago

I see some big changes:

  1. the line endings changed from \n (unix) to \r\n (DOS)
  2. all cell values were double quoted, including the headers (e.g., Scientific_Name ->"Scientific_Name")
  3. content no longer UTF-8 encoded

The second second one appears to have broken the integration - in tab-separated value files, double quotes are not escape quotes. Because of this, GloBI does not recognize the column headers anymore.

I suspect that someone edited the file in excel, then exported the file without making sure to select "UTF-8" encoding, no string quoting and unix line endings.

Note that keeping the tsv file formatting consistent would make it easier to see specific changes made using the tools provided by GitHub. When the formatting changes, GitHub's diff shows all lines in the file as changed.

Curious to hear your thoughts on how to prevent this from happening in the future - these text formatting issues can be tricky to detect by humans, but automated review scripts (like GloBI!) can be create to verify the desired encodings.

Hope this helps.

jhpoelen commented 4 years ago

Closing issue, fixed via https://github.com/hurlbertlab/dietdatabase/compare/a08300a8780e...e8b8267d69ed .

ahhurlbert commented 4 years ago

Noting for the sake of this thread your comment elsewhere that you meant to say the line endings should actually be \n and not \r\n.

All of our data entry is actually done in Excel; this problem actually arose from me writing the file out in R while doing some ad hoc cleaning without realizing some of the defaults with respect to line ending and quotes in my write.table() statement.

In general I find there are some annoying sources of friction with using tsv files in Excel by less experienced users. I don't generally expect this sort of error due to working with Tab-delimited .txt files in Excel, but admit that it makes it harder to see what's wrong when someone makes the mistake above!

jhpoelen commented 4 years ago

Thanks for providing context and for sharing the end of line note.

I am glad we have an automated peer-review process in place, so that we can address the issue as it happens. If you have some suggestions on improving the GloBI review, please let me know.