Closed reagle closed 1 year ago
The TSV format is really simple and doesn't allow quoting, so newlines and tabs aren't allowed inside field values. I'm guessing your files are actually CSV files with TAB as the field delimiter, so try loading them with:
vd -f csv --csv-delimiter $'\t' <filename>
If you then save as a .csv file, then I'd think the resulting file should be equivalent.
So, 1) I don't think this is a VisiData bug, but a misunderstanding of the file format you have (which is fairly common in TSV/CSV land), and 3) VisiData can't really provide a warning, without doing a lot more (slow) checking of every value (which maybe we could add to the --safety-first
option--but that's a fair amount of work). Hopefully the above helps you avoid the brokenness you're asking about with 2).
Testing -pandas --safety-first
on the full 1.8GB TSV file, I do see it's slower writing and that the result still can not be parsed by pandas.
native (1.8GB TSV file):
load: 17s (first rows immediate)
save: 95s
-pandas:
load: 18s (first rows immediate)
save: 96s
-pandas --safety-first
load: 19s (first rows immediate)
save: 140s
File ~/.pyenv/versions/3.11.3/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
232 try:
233 if self.low_memory:
--> 234 chunks = self._reader.read_low_memory(nrows)
235 # destructive to chunks
236 data = _concatenate_chunks(chunks)
File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 10 fields in line 5690653, saw 16
@saulpw, thanks for the response! There does seem to be only a single errant newline (in a 1.8GB file), so it's probably meant to be a TSV, but is not because of that single line. And because of that error, I suppose it is more of a CSV file with a tab delimiter -- after all, I can load it with pandas via pd.read_csv(filename, sep='\t')
.
That leads me to wonder, then, if you have a TSV parser, why? Why not treat it as a tab delimited CSV file -- which I think is what pandas is doing? (And how does a dummy like me best avoid stepping in this cow pie?)
That leads me to wonder, then, if you have a TSV parser, why?
TSV is not a tab-delimited CSV file. CSV is ambiguously specified, with all the options for quoting and escaping and whatnot. TSV (or any ASV/USV) is super-simple: fields are delimited with one ASCII/Unicode character, and rows are delimited with another. The only constraint is that the fields themselves can't contain either of the delimiters. Otherwise, any value is possible and will be parsed verbatim. This means that loading/parsing can be really fast, since you can just scan for the delimiters and copy the results as a pure memory chunk, without having to interpret anything.
If you want to treat it as a tab-delimited CSV, you can. The Pandas loader may be making a more intelligent guess at the format.
At the beginning of Richard Feynman's Lectures on Physics, week 17 on Spacetime (around 3 minutes in), he says:
I wish to apologize for something that is not my responsibility, but is the result of physicists all over the world, of scientists (so-called) [who] have been measuring things in different units and cause an enormous amount of complexity. So as a matter of fact nearly a third of what you have to learn consists of different ways of measuring the same thing and I apologize for it.
I've been thinking about this for years, and I have to say, the same thing is true for data: nearly a third of what you have to deal with in data is formats, parsing and saving and converting and guessing dialects and handling errors and so-forth, and I apologize for it.
Well said! I nominate this for the best comment of the year -- to whom, and for what prize, I do not know. 🙂
Returning to the pragmatic, how can a user check if a TSV file is really a TSV file? Or, is there even a way to write a valid TSV file from pandas? I'm thinking not, even though all the documentation says to write a tab delimited CSV.
So, the trick with writing a valid TSV file, is that "the fields themselves can't contain either of the delimiters". If you can guarantee this isn't the case, then writing out a TSV file is trivial and quite fast. If you can't guarantee it, then you have to scan each and every field for the delimiters and replace them with a substitute string--this is what the safety_first
option in VisiData does, and it's dog slow. And when the resulting file is loaded back in, the data won't be exactly the same, because those substitutions won't be parsed into the delimiters.
That's why Pandas says to write a tab-delimited CSV, so that it has the ability to quote the separators; although if you're writing a CSV anyway, I don't see any benefit to it being tab-delimited, as actual comma-separated CSVs (for all their warts) are better supported by the most common applications like Excel, and having a tab-delimited CSV would be confusing (as you are seeing).
As for how to check it that it's a true TSV versus a tab-delimited CSV, I would look for "\t
(quote-tab) strings; if fields are quoted, then it's almost certainly a CSV with alternate delimiters. In particular, newlines are much more common than tab characters to be embedded within fields, so a tell-tale sign that you have a tab-delimited CSV is to find a line that matches the regex ^[^"]+"\t
--that is, at the beginning of the line, a sequence of non-quote characters, followed by a quote and then a tab-delimiter. This would mean that a newline occurred within a field and had to be quoted (alternatively, if it's a proper TSV file, this would be a row in which the first column ends in a double-quote character but has no other double-quotes within it, which seems much more unlikely).
Thanks @saulpw . I find myself wishing there was something like chardet (for encodings) or a TSV linter. There are CSV linters, but that doesn't help, from what I can see.
I isolated the issue in the attached 'anon.tsv', wherein there's an odd LF/CR.
Hi @reagle. Where is the odd LF/CR in this file?
@aborruso line 4 is actually the last portion of line 3.
pandas and polars somehow know there are only 3 lines of data; visidata acts like a text editor and shows an extra line.
I'm working with four large TSVs in VisiData and two of them end up causing parse errors in pandas/polars.
I could first see evidence of this by simply loading the files into VisiData and then immediately saving. In two of the files, the
md5sum
results are the same, in two of them, they are different.I isolated the issue in the attached 'anon.tsv', wherein there's an odd LF/CR. VisiData reads the file as having a header and 4 data rows, pandas/polars/excel see a header and 3 data rows. Once VisiData writes its understanding of the file, nothing else can read it.
anon.csv (This is named 'csv' because github doesn't permit tsv uploads, more confusion!)
On Discord, @ajkerrigan suggested that I load the file with the
-f pandas
option and save it "safely," and that does indeed seem to work:vd -f pandas --safety-first anon.tsv
. They wrote that "the safety_first option should replace embedded newlines/tabs (with characters specified by the tsv_safe_newline and tsv_safe_tab options) during a save."