nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

Pick a meaning of TSV and use it consistently #1566

Open tsibley opened 1 month ago

tsibley commented 1 month ago

We should pick between either IANA TSVs or RFC 4180 CSV-like TSVs and then be consistent about that choice everywhere. Problems arise when mixing the two, e.g. expecting one but getting the other.

IANA TSVs prohibit tabs and newlines in field values. Parsing is much simpler because of this, but generators need to ensure tabs and newlines are stripped/replaced or cause an error, lest they risk generating corrupt output. The programs in tsv-utils work with IANA TSVs.

CSV-like TSVs can represent all values without limitation. Parsing is more complicated because of quoting and escaping (but libraries exist). The programs in csvtk work with CSV-like TSVs. The Python csv module works with CSV-like TSVs by default, but can be configured to work with IANA-like TSVs instead. (IANA-like, because it will error on some outputs despite them having valid IANA TSVs forms. Some inputs may also misbehave, despite being valid.)

Augur mostly uses CSV-like TSV with the csv module configured in QUOTE_MINIMAL quoting mode (the default). The exception is augur curate commands which use IANA-like TSV with the csv module configured in QUOTE_NONE mode, meaning it will error if field values contain a tab or newline.

Related to:

genehack commented 1 month ago

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

The reverse is not true: something expecting an IANA TSV will not accept a CSV-like TSV that contains a field with an embedded tab or newline.

If both of those statements are accurate, settling on CSV-like TSVs feels like the most compatible option — why would we not want to do that?

tsibley commented 1 month ago

~Correct. And I agree.~ Actually, not correct.

While this issue is about consistency within Augur itself, it relates to consistency within the broader Nextstrain ecosystem when using TSVs. We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

corneliusroemer commented 1 month ago

Thanks @tsibley for raising this issue (and teaching me something new)! I was not fully conscious of there being 2 incompatible ways libraries/tools parse and generate TSVs.

Technically, RFC 4180 does not mention TSVs or any separators other than comma at all. So if we wanted to do what the most authoritative spec says, it would clearly be IANA TSV we should be using. However, it is true that many tools seem to treat TSVs as RFC 4180 CSVs with the only difference being the separator used.

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error. Also roundtripping through RFC-CSV might not be idempotent. E.g. Implementing tools one single quote is perfectly valdouble quotes ("""") will be wrongly (and silently wrongly) interpreted by a library that expects CSV-TSVs as only containing a single double quote ".

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.

>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""

The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV. So we can implement both in Augur with minimal (though breaking) changes. Whatever we decide, we break some augur command's handling of TSVs.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI. In fact, I would speculate that the reason TSVs are so commonly used in bioinformatics is precisely because one doesn't need to escape commas and double quotes. What's the point of using TSVs if you end up having to escape common characters (double quotes)? I'm pretty sure that double quotes are more commonly encountered in common augur input data than tabs or newlines.

Let me list all the options we have here, I think there's actually more than picking one or the other:

If we were to support both, one could also add an "auto" input mode, that dynamically decides the mode when encountering the first line that positively identifies the file as being according to IANA or RFC spec (i.e. when it encounters a line that's invalid as IANA but valid as RFC, use RFC mode and vice versa, sometimes both might be valid but have semantic differences, e.g. a "" field - in that case auto mode could either fail or use a default mode). Possibly, input "auto" mode could also be used to set the output mode automatically.

I think it would really help us if we analyzed input TSV files in our repos and see whether they are a) valid for both, or b) valid only for IANA, c) valid only for RFC. With some data, it's easier to make a good decision here.

Sorry for the essay here - this is quite a fascinating topic :)

We use tsv-utils (IANA TSVs) a lot (e.g. in our workflows, in ad-hoc data processing) but don't habitually pass inputs thru csv2tsv --csv-delim $'\t' to convert from CSV-like TSVs to IANA TSVs first. We might need to start doing more of that if we keep Augur producing CSV-like TSVs.

It's good to know that some RFC-TSVs can be converted to IANA as long as only " but not \t or \n are used in fields. But we should only do this if all of the downstream usage assumes IANA tsvs, otherwise, one would have to unnecessarily re-escape ".

I wonder if it might make sense to start using non-ambiguous extensions, e.g. .itsv (IANA) and .ctsv (RFC-CSV), to be explicit about which spec the file follows.

tsibley commented 4 weeks ago

Making sure I understand: IANA TSVs can be used as if they're CSV-like TSVs — in other words, anything that accepts CSV-like TSVs will accept IANA TSVs without complaint.

That's not correct. For each type, there are files that are valid in exactly one of them: a valid single column IANA TSV line like " is not valid per RFC CSV specs, so there's undefined behaviour, it might be read correctly, but it might also error.

Ah, yes, you're right. Thanks for this catch!

Example: Python's csv package if configured with delimiter="\t" misinterprets the valid IANA-TSV file A\n" (one column A, two lines, non-header line contains ") as having empty string on the second line. Round tripping causes a change, the output has 2 double quotes, instead of 1.

>>> import csv; d=list(csv.reader(['A','"'],delimiter="\t")); print(d);
[['A'], ['']]
>>> import io; output=io.StringIO(); csv.writer(output, delimiter="\t").writerows(d); print(output.getvalue())
A
""

The nice thing about the Python csv package is that it can be configured in both ways: compliant with IANA or compliant with RFC-CSV-TSV

In that example, the single double quote, ", isn't being doubled per se. It's being parsed to an empty string and then when writing that empty string, it's represented in quotes per RFC standards.

You can get the output you expect when writing a value of " if you do more than set delimiter, for example:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['"']]);
A
"

but that's abusing quotechar and will break for values with a literal tab in them (which is ~expected, but it breaks for the wrong reason (no escape char instead of no quote char)).

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['\t']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set

But even the dialect options I set above don't handle this case below of a single empty field:

>>> csv.writer(sys.stdout, delimiter="\t", quoting=csv.QUOTE_NONE, doublequote=False, quotechar="\t", escapechar=None).writerows([['A'],['']]);
A
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: single empty field record must be quoted

despite that having a valid IANA representation.

The csv module is really meant for RFC CSVs and related formats and it's non-trivial to make it produce true IANA TSVs. Maybe even impossible.

IANA TSVs are a good fit if your input fields never contain tabs nor newlines. Arguably that's the case in bioinformatics - i'm not sure I've ever encountered a tab or newline character in any metadata field from GISAID or NCBI.

I've seen plenty of tabs and newlines and other stuff in metadata before. If we're going to emit IANA TSVs for data we read from other formats, we need to be consistent about removing/replacing those chars before serialization. Using RFC TSVs means we don't have to worry about it: the csv library will take care of it for us.

Let me list all the options we have here, I think there's actually more than picking one or the other:

There are other options, like the ones you've listed, yes, but I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

tsibley commented 4 weeks ago

(I've updated the issue description to differentiate between IANA TSVs and IANA-like TSVs, i.e. what we can do with Python's csv.)

tsibley commented 1 week ago

I think it would be much much simpler to reason about data and files and write documentation, help folks out, etc. if we could pick one format for Augur (and really, the broader Nextstrain ecosystem more generally).

This is my main point with this issue. Currently to properly handle various Augur outputs requires knowing a priori which particular variant of TSVs it produces. It is not straightforward or clear, much less documented. We should pick one and stick to it and document it (and patterns for working with them).

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

genehack commented 1 week ago

From my viewpoint, since Python's csv module is so handy and used broadly in Augur but (AFAICT) can't actually produce IANA TSVs, the easiest option is to standardize our TSVs as RFC 4180 CSV-like TSVs.

+1

joverlee521 commented 4 hours ago

Discussion in https://github.com/nextstrain/measles/pull/52#discussion_r1755151375, prompted me to look into how Nextclade writes TSVs. I assume it's producing CSV-like TSVs since it's using the csv::WriteBuilder and only specifies the delimiter without changing any of the other defaults such as quoting.