p2 / quicklook-csv

A QuickLook plugin for CSV files
Other
289 stars 22 forks source link

QuickLookCSV renders wrong amount of columns under certain circumstances #11

Closed p2 closed 11 years ago

p2 commented 11 years ago

Original author: p....@gmx.at (September 05, 2012 13:54:11)

APPLICATION / ENVIRONMENT: I download connection-log-files from my mobile network operator. Previously the QuickLook preview for that files worked fine.

Recently they changed the format. Previously there was a header of 7 rows before the real data begins. In the most recent version there are 8 preceding lines.

BUG DESCRIPTION: Now the preview of that files renders the wrong amount of columns (2 instead of 13).

WORKAROUND: Delete the 1 additional line in a text editor (line 8), then it is shown correct. But I'd like to keep that line for completeness (the due date of my bill).

FURTHER PERFORMED TESTS & LOGICAL CONCLUSION ATTEMPTS: It you remove the linebreak between line 7 and 8 the bug still occurs. But if you remove a certain amount of characters in that line, it suddenly renders the correct amount of columns! No explanation comes to my mind for that behavior (as I don't know the code behind).

SUBMITTED MATERIAL: I opened the original file in an hex editor and redacted private info and kept only 3 sample lines, as this is enough for the bug demonstration. I created a duplicate of this, and modified it until the preview was correct. Attached are all file states and screenshots of them.

VERSION: I am using Mac OS X 10.6.8 but I am not sure which version of QuickLookCSV I am using, as the Finder info and the web site info contradict each other.

Website: 1.1.1 ( http://code.google.com/p/quicklook-csv/downloads/detail?name=QuickLookCSV.dmg&can=2&q= ) Finder > file info for .qlgenerator bundle > version: 1. DMG SHA1: efef48e1bfba02d780a57fa67b230fae53ae6163 (remote and local copy match) DMG file size: 59.806 bytes .qlgenerator bundle size: 145.101 bytes

Original issue: http://code.google.com/p/quicklook-csv/issues/detail?id=11

p2 commented 11 years ago

From p....@gmx.at on September 05, 2012 14:01:38 I used a hex editor rather than a text editor to ensure that the issue is not related to text/linebreak encoding.

p2 commented 11 years ago

From p....@gmx.at on October 18, 2012 16:11:38 Please react to my bug report.

p2 commented 11 years ago

From p2 on October 23, 2012 23:09:07 Sorry, I don't seem to be receiving emails for bug reports any more!

The problem you describe has to do with how the plugin detects which fields to use to separate the cells. It does so using only the first 200 characters, which in your case is not sufficient. I'll up that limit to make it work for you.

p2 commented 11 years ago

From p2 on October 24, 2012 00:43:34 Fixed in rev 24

p2 commented 11 years ago

From p....@gmx.at on October 24, 2012 08:23:34 AUTODETECT_NUM_FIRST_CHARS = 1000 That are 10 lines with 99 chars. If someone has a longer commentary introduction before the CSV data starts, it will fail again.

Can we increase AUTODETECT_NUM_FIRST_CHARS a bit more, to an amount which is grateful towards introduction text while still remaining performant? If someone quick-looks through multiple files, the delay inbetween should not be too long. What number would that be? 10000? If you assume a low read rate of 3 MB/sec (i.e. USB flash drive), then this would be read in 3 milliseconds (10000 bytes / 3000000 bytes/sec = 0.003 secs), even 30ms (0.3MB/sec read) would be acceptable I think.

p2 commented 11 years ago

From p2 on October 24, 2012 12:42:13 The autodetect is not implemented in a very smart way. What it does is split the first 1000 characters on comma, tab, semicolon and pipe and then choses the separator that gives the most columns. If you have a "real" CSV separated by comma and the values contain semicolons or pipes, then this algorithm will wrongly not use the comma but the other separator. That's the main reason I've limited it to the first 1000 (and 200 before) characters, though this might be a futile attempt.

p2 commented 11 years ago

From p....@gmx.at on October 24, 2012 15:59:50 Another algorithm idea:

Preparation: Read in the first AUTODETECT_NUM_FIRST_CHARS Then split it across the lines. Get a statistic of comma, tab, semicolon, pipe per line.

Analysis: Check if one of these amounts is the same per each line. If 0 candidates match that condition, then choose the candidate which occurs at most within all lines. Not necessarily correct, but likely true. If 1 candidate matches that condition, this must be the separator. Only very few exceptions imaginable. If 1+ candidates match that condition, take the one which occurs first in the first line. Not necessarily true, but again likely, as the first cell's probability to contain a separator-class-character as a literal is not that high.

p2 commented 11 years ago

From p....@gmx.at on November 06, 2012 15:37:05 What do you think about my alternative detection proposal?

p2 commented 11 years ago

From p2 on November 09, 2012 01:23:31 Because this is merely a sideproject I'm not going to invest more time any time soon. Your approach has one problem, which is the "split by lines" part. CSV can have newline data, protected by double quotes. Thus it would need to be correctly parsed, once for every possible separator, before this can be decided. That's why I limit it to a few chars at the start; it could definitely be better, but it's good enough.

p2 commented 11 years ago

From p....@gmx.at on November 09, 2012 07:37:51 Ok. Maybe sometime in the future, if you are in the mood for it. Thanks for what you offered so far!