Closed jhpoelen closed 4 years ago
I’ve tried checking over the file and recent edits but don’t see what’s causing this error. Can you provide any guidance?
From: Jorrit Poelen notifications@github.com Sent: Monday, August 31, 2020 11:36 AM To: hurlbertlab/dietdatabase dietdatabase@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [hurlbertlab/dietdatabase] GloBI data review of Avian Diet Database: unable to parse diet data records (#141)
I just noticed in an automated GloBI data review of Avian Diet Database that GloBI was unable to parse the Avian data diet data on commit ff0ebeahttps://github.com/hurlbertlab/dietdatabase/commit/ff0ebeaf44cc9af9e9c657013ec3cfc4bb840d98 .
Please let me know if anything on the GloBI side needs to be updated to resolve this integration issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/hurlbertlab/dietdatabase/issues/141, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAUTEDX5YO4A4DA4FXHR2BDSDO7OPANCNFSM4QQUZE7Q.
I see some big changes:
\n
(unix) to \r\n
(DOS) Scientific_Name
->"Scientific_Name"
)The second second one appears to have broken the integration - in tab-separated value files, double quotes are not escape quotes. Because of this, GloBI does not recognize the column headers anymore.
I suspect that someone edited the file in excel, then exported the file without making sure to select "UTF-8" encoding, no string quoting and unix line endings.
Note that keeping the tsv file formatting consistent would make it easier to see specific changes made using the tools provided by GitHub. When the formatting changes, GitHub's diff shows all lines in the file as changed.
Curious to hear your thoughts on how to prevent this from happening in the future - these text formatting issues can be tricky to detect by humans, but automated review scripts (like GloBI!) can be create to verify the desired encodings.
Hope this helps.
Closing issue, fixed via https://github.com/hurlbertlab/dietdatabase/compare/a08300a8780e...e8b8267d69ed .
Noting for the sake of this thread your comment elsewhere that you meant to say the line endings should actually be \n
and not \r\n
.
All of our data entry is actually done in Excel; this problem actually arose from me writing the file out in R while doing some ad hoc cleaning without realizing some of the defaults with respect to line ending and quotes in my write.table() statement.
In general I find there are some annoying sources of friction with using tsv files in Excel by less experienced users. I don't generally expect this sort of error due to working with Tab-delimited .txt files in Excel, but admit that it makes it harder to see what's wrong when someone makes the mistake above!
Thanks for providing context and for sharing the end of line note.
I am glad we have an automated peer-review process in place, so that we can address the issue as it happens. If you have some suggestions on improving the GloBI review, please let me know.
I just noticed in an automated GloBI data review of Avian Diet Database that GloBI was unable to parse the Avian data diet data on commit https://github.com/hurlbertlab/dietdatabase/commit/ff0ebeaf44cc9af9e9c657013ec3cfc4bb840d98 .
Please let me know if anything on the GloBI side needs to be updated to resolve this integration issue.