psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

Why are there forbidden characters in sequence names? #231

Closed laserson closed 6 years ago

laserson commented 7 years ago

Illumina sequence ids are chock-full-o-colons. Is this a hard requirement?

psathyrella commented 7 years ago

So both the python and c++ components make extensive use of hash maps for storing precomputed likelihoods and naive sequences (which are computationally prohibitive to recalculate), and also communicate with each other using csv files. In both of these, we need a way to refer to a cluster of sequences (i.e. to point to the likelihood that those sequences are clonal), and the way that's implemented now is as a colon-separated string, e.g. the cluster of sequences a, b, and c is a:b:c. The other forbidden characters, semicolon and comma, are I think only used in the latter (csv file) use case.

So short answer, I don't think there's a reasonable way around having some separator. But, I don't think this should cause undue problems even if you have them in your sequence ids? It should be printing a warning that it's performing this mapping, and proceed. If you have tons of forbidden characters in your ids, and if you also have the single-character replacements in your original ids, it's true you could run into collisions translating back from the output file, but that would be easy to fix by translating : to, say, __c__ instead of just c. I think I had it as something longer like __c__ originally, even.

That said, I'm definitely open to other suggestions -- I've only had one other report of input files with these characters, and I've never run into them myself (except in testing the current fix).

laserson commented 7 years ago

Is the colon-concatenated string only used internally? Could you instead use a tab or a non-printable character? Alternatively, is it too expensive to keep track of the original sequence identifiers and write them out later on?

The illumina fastq spec definitely has many colons: http://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm

psathyrella commented 7 years ago

No, it's also in the output formats for annotation and partitioning. I like the output csvs being easilly human- and less-readable, so I'm not wild about using a non-printable character.

Not super expensive, but if I was gonna do something like that, it'd be very easy to, if it detects forbidden characters, write out an extra file with translations between the input names and the translated names, so you can read the translation file and output file at the same time and go back to the original names in memory, after you've parsed the csv output format.

for concreteness, if you have a: and :b as input sequences, say they get translated to a__c__ and __c__b (with checks to make sure the translations remain unique) a few output columns might look like

unique_ids,seqs
a__c__:__c__b,ACGGTGCGCGA:CCGTGTGTGTGCGC

and the translation file would be

input_uid,translation
a:,a__c__
:b,__c__b