TSV is not just a tab-delimited CSV

reagle commented 1 year ago

I'm working with four large TSVs in VisiData and two of them end up causing parse errors in pandas/polars.

I could first see evidence of this by simply loading the files into VisiData and then immediately saving. In two of the files, the md5sum results are the same, in two of them, they are different.

I isolated the issue in the attached 'anon.tsv', wherein there's an odd LF/CR. VisiData reads the file as having a header and 4 data rows, pandas/polars/excel see a header and 3 data rows. Once VisiData writes its understanding of the file, nothing else can read it.

anon.csv (This is named 'csv' because github doesn't permit tsv uploads, more confusion!)

On Discord, @ajkerrigan suggested that I load the file with the -f pandas option and save it "safely," and that does indeed seem to work: vd -f pandas --safety-first anon.tsv. They wrote that "the safety_first option should replace embedded newlines/tabs (with characters specified by the tsv_safe_newline and tsv_safe_tab options) during a save."

Is this a bug that can be fixed in VisiData?
If I were to load the file in and save it (without safety or pandas), I assume the brokeness is baked in and can't be recovered from?
Is this something VisiData could warn the user about?

saulpw commented 1 year ago

The TSV format is really simple and doesn't allow quoting, so newlines and tabs aren't allowed inside field values. I'm guessing your files are actually CSV files with TAB as the field delimiter, so try loading them with:

vd -f csv --csv-delimiter $'\t' <filename>

If you then save as a .csv file, then I'd think the resulting file should be equivalent.

So, 1) I don't think this is a VisiData bug, but a misunderstanding of the file format you have (which is fairly common in TSV/CSV land), and 3) VisiData can't really provide a warning, without doing a lot more (slow) checking of every value (which maybe we could add to the --safety-first option--but that's a fair amount of work). Hopefully the above helps you avoid the brokenness you're asking about with 2).

reagle commented 1 year ago

Testing -pandas --safety-firston the full 1.8GB TSV file, I do see it's slower writing and that the result still can not be parsed by pandas.

native (1.8GB TSV file):
    load: 17s (first rows immediate)
    save: 95s
-pandas:
    load: 18s (first rows immediate)
    save: 96s
-pandas --safety-first
    load: 19s (first rows immediate)
    save: 140s

File ~/.pyenv/versions/3.11.3/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 10 fields in line 5690653, saw 16

reagle commented 1 year ago

@saulpw, thanks for the response! There does seem to be only a single errant newline (in a 1.8GB file), so it's probably meant to be a TSV, but is not because of that single line. And because of that error, I suppose it is more of a CSV file with a tab delimiter -- after all, I can load it with pandas via pd.read_csv(filename, sep='\t').

That leads me to wonder, then, if you have a TSV parser, why? Why not treat it as a tab delimited CSV file -- which I think is what pandas is doing? (And how does a dummy like me best avoid stepping in this cow pie?)

saulpw commented 1 year ago

That leads me to wonder, then, if you have a TSV parser, why?

TSV is not a tab-delimited CSV file. CSV is ambiguously specified, with all the options for quoting and escaping and whatnot. TSV (or any ASV/USV) is super-simple: fields are delimited with one ASCII/Unicode character, and rows are delimited with another. The only constraint is that the fields themselves can't contain either of the delimiters. Otherwise, any value is possible and will be parsed verbatim. This means that loading/parsing can be really fast, since you can just scan for the delimiters and copy the results as a pure memory chunk, without having to interpret anything.

If you want to treat it as a tab-delimited CSV, you can. The Pandas loader may be making a more intelligent guess at the format.

At the beginning of Richard Feynman's Lectures on Physics, week 17 on Spacetime (around 3 minutes in), he says:

I wish to apologize for something that is not my responsibility, but is the result of physicists all over the world, of scientists (so-called) [who] have been measuring things in different units and cause an enormous amount of complexity. So as a matter of fact nearly a third of what you have to learn consists of different ways of measuring the same thing and I apologize for it.

I've been thinking about this for years, and I have to say, the same thing is true for data: nearly a third of what you have to deal with in data is formats, parsing and saving and converting and guessing dialects and handling errors and so-forth, and I apologize for it.

reagle commented 1 year ago

Well said! I nominate this for the best comment of the year -- to whom, and for what prize, I do not know. 🙂

Returning to the pragmatic, how can a user check if a TSV file is really a TSV file? Or, is there even a way to write a valid TSV file from pandas? I'm thinking not, even though all the documentation says to write a tab delimited CSV.

saulpw commented 1 year ago

So, the trick with writing a valid TSV file, is that "the fields themselves can't contain either of the delimiters". If you can guarantee this isn't the case, then writing out a TSV file is trivial and quite fast. If you can't guarantee it, then you have to scan each and every field for the delimiters and replace them with a substitute string--this is what the safety_first option in VisiData does, and it's dog slow. And when the resulting file is loaded back in, the data won't be exactly the same, because those substitutions won't be parsed into the delimiters.

That's why Pandas says to write a tab-delimited CSV, so that it has the ability to quote the separators; although if you're writing a CSV anyway, I don't see any benefit to it being tab-delimited, as actual comma-separated CSVs (for all their warts) are better supported by the most common applications like Excel, and having a tab-delimited CSV would be confusing (as you are seeing).

As for how to check it that it's a true TSV versus a tab-delimited CSV, I would look for "\t (quote-tab) strings; if fields are quoted, then it's almost certainly a CSV with alternate delimiters. In particular, newlines are much more common than tab characters to be embedded within fields, so a tell-tale sign that you have a tab-delimited CSV is to find a line that matches the regex ^[^"]+"\t--that is, at the beginning of the line, a sequence of non-quote characters, followed by a quote and then a tab-delimiter. This would mean that a newline occurred within a field and had to be quoted (alternatively, if it's a proper TSV file, this would be a row in which the first column ends in a double-quote character but has no other double-quotes within it, which seems much more unlikely).

reagle commented 1 year ago

Thanks @saulpw . I find myself wishing there was something like chardet (for encodings) or a TSV linter. There are CSV linters, but that doesn't help, from what I can see.

aborruso commented 1 year ago

I isolated the issue in the attached 'anon.tsv', wherein there's an odd LF/CR.

Hi @reagle. Where is the odd LF/CR in this file?

reagle commented 1 year ago

@aborruso line 4 is actually the last portion of line 3.

reagle commented 1 year ago

pandas and polars somehow know there are only 3 lines of data; visidata acts like a text editor and shows an extra line.

saulpw / visidata

TSV is not just a tab-delimited CSV #2038