possible for .uc files to be relabel_sha1-aware?

gregcaporaso commented 9 years ago

I'm using vsearch in dereplication mode. In my output fasta I would like to have the sequences relabeled based on their sha1, and I'd like to have a .uc file generated as well. This works with the following command:

vsearch --derep_fulllength seqs.fna --output vsearch-derep.fna --uc vsearch-derep.uc --relabel_sha1

However the issue I'm running into is that the sequence identifiers in the .uc file are the original identifiers, not the relabeled (sha1) identifiers. Would it be possible for vsearch to write the relabeled identifiers instead of the original identifiers to the .uc file? (Or is it possible, and I'm just missing that option?)

Thanks!

torognes commented 9 years ago

This is not possible yet, but could easily be added. Will look into it soon.

gregcaporaso commented 9 years ago

Great, this would be really helpful.

torognes commented 9 years ago

Do you want only the header of the representative sequences to be replaced by their sha1 hash in the uc file? If the identifiers of all sequences in the uc file are replaced by their sha1 hash, all sequences in the same cluster will get identical identifiers since their sequences are identical, making them somewhat meaningless.

gregcaporaso commented 9 years ago

Ah, good point, hadn't thought of that...

My use case is that I want to dereplicate sequences that were demultiplexed with QIIME, and build a BIOM table. For that purpose, I need to define OTU/observation ids, and the sha1 of the sequences is perfect for that as it would allow merging of tables that are created in different runs of vsearch. But, I'm realizing there is a problem with this: I need the sample identifiers (which are part of the input sequence identifiers in the QIIME demultiplexed sequence files) when I build the BIOM table so relabeling these in the uc file wouldn't work. The input file I'm working with looks like:

>69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTCCGCCGTCAAATCCCAGGGCTCAACCCCGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATG
>3N5W8XDEVDFMR_1 M00365:23:000000000-A5A85:1:1101:17158:1603 1:N:0:0 orig_bc=TCGAGACGCTTA new_bc=TCGAGACGCTTA bc_diffs=0
TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGTGGTGAAATCCCCGGGCTCAACCTGGGAACTGCCATTGTGACTGCAAGGCTAGGGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA

Where, for example, 69WD3WM3D1EYU is my sample identifier, and 0 is an arbitrary sequence identifier.

Could the original sequence identifiers be included in the fasta comment field when passing --relabel_sha1? That fasta would then go from looking like this:

>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737 
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
>0c7f96a224e198b80cf0c29213c7109ec720cdf6 
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCCATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATGGCTTGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA

to this:

>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737 69WD3WM3D1EYU_0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
>0c7f96a224e198b80cf0c29213c7109ec720cdf6 3N5W8XDEVDFMR_1
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCCATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATGGCTTGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA

(not sure if those sha1 values match up to those exact sequences, just created the example).

Then I can parse the fasta file to build a map of the old to new identifiers from the fasta file to use when creating my BIOM table.

colinbrislawn commented 9 years ago

Hello @gregcaporaso

I want to dereplicate sequences that were demultiplexed with QIIME, and build a BIOM table. ... Could the original sequence identifiers be included in the fasta comment.

I would describe that as 'adding a label.' To me, 're-labeling' implies changing the names of the reads because each read represents something different.

Your original fasta entries represent single observations from samples. Because of this, each read is labeled with its sample name, like 69WD3WM3D1EYU
After dereplication, these fasta entries now represent every single observation in the full study. Because of this, reach read is each labeled with its hash, like 0c7f96a224e198df6

These methods all end with a fasta file. In order to build an abundance table / BIOM file, remap your original reads to this fasta file. For example:

[map_test]$ ls
seqs.fna

[map_test]$ vsearch --derep_fulllength seqs.fna --output seqs.derep.fna --relabel_sha1
[map_test]$ vsearch -usearch_global seqs.fna -db seqs.derep.fna -strand plus -id 1.0 -uc map_100.uc -threads 16
[map_test]$ ls
map_100.uc  seqs.derep.fna  seqs.fna

[map_test]$ head map_100.uc 
H   40668   252 100.0   +   0   0   =   LBNL.A.8_0  41394784bd8a763fdb9f0655ae6ada4ba9028476
H   90135   253 100.0   +   0   0   =   UNC.A.2_6   20123f913c018156a801783b741f7cd6c0912971
H   60  253 100.0   +   0   0   =   LBNL.B.2_8  0d7e725551fc8d38230695b78431cb7e19edd59e
H   2840    252 100.0   +   0   0   =   UNC.A.8_11  3fa57742efd2a46660fe2340dc0582794301e8f9
H   16444   253 100.0   +   0   0   =   UNC.A.8_4   47abd53ffaad23920ce1a384fab31ee377e77356
H   17  253 100.0   +   0   0   =   LBNL.B.3_1  adb7f636c106b2cb763f53382c091821ded73e51
H   60  253 100.0   +   0   0   =   LBNL.B.2_7  0d7e725551fc8d38230695b78431cb7e19edd59e
H   1317    252 100.0   +   0   0   =   LBNL.A.8_10 607f3623014807e3555ed3726fff61a519a905a3
H   41006   252 100.0   +   0   0   =   LBNL.A.8_12 e784317def5ab28d30cb577627dab4262fdfe0b3
H   159 253 100.0   +   0   0   =   LBNL.A.8_13 bd5e6dc8b19bdd592ccfb5f1c59041f90f31435c

[map_test]$ uc2otutab_mod.py map_100.uc > biom_table.txt

[map_test]$ biom convert --table-type="OTU table" -i biom_table.txt -o biom_table.txt.biom --to-json

Does this do what you want it to do?

Edit: remove example of OTUs.

gregcaporaso commented 9 years ago

No, I'm not looking to do OTU clustering. Instead, I want the dereplicated sequences to represent my OTUs. I know how to do this, but including the hashs with the vsearch output that I'm already generating would speed up the process.

colinbrislawn commented 9 years ago

I want the dereplicated sequences to represent my OTUs.

Ah ok. The example code I provided does this. (I should not have mentioned OTU clustering in my example.)

gregcaporaso commented 9 years ago

Yes, but it's requiring two runs of vsearch where the second one is going to be relatively very slow. I have all of the information that I need from the vsearch --derep_fulllength command that I posted above. I'm just looking to get a mapping of the original sequence id to the hash so I don't have to recompute the hashes to use them as OTU ids.

colinbrislawn commented 9 years ago

Point taken. Remapping with global alignments is very slow. Remapping with --search_exact would be fast, but would still involve another step.

What if the --uc output of --derep_fulllength was different. Current version:

[data]$ head -n 5 seqs.derep.uc 
S   0   252 *   *   *   *   *   UNC.A.7_72  *
H   0   252 100.0   *   0   0   *   LBNL.B.7_178    UNC.A.7_72
H   0   252 100.0   *   0   0   *   LBNL.Ctl_184    UNC.A.7_72
H   0   252 100.0   *   0   0   *   LBNL.A.7_331    UNC.A.7_72
H   0   252 100.0   *   0   0   *   LBNL.A.7_335    UNC.A.7_72

What if the new sha1 label was listed in an additional column? Like this:

[data]$ head -n 5 rdp_derep.uc 
S   0   252 *   *   *   *   *   UNC.A.7_72  *   abc4a8123e72881889eee5728ea
H   0   252 100.0   *   0   0   *   LBNL.B.7_178    UNC.A.7_72  abc4a8123e72881889eee5728ea
H   0   252 100.0   *   0   0   *   LBNL.Ctl_184    UNC.A.7_72  abc4a8123e72881889eee5728ea
H   0   252 100.0   *   0   0   *   LBNL.A.7_331    UNC.A.7_72  abc4a8123e72881889eee5728ea
H   0   252 100.0   *   0   0   *   LBNL.A.7_335    UNC.A.7_72  abc4a8123e72881889eee5728ea
(made up, example hashes)

You can then pull those hashes from the .uc output or parse that into a .biom table.

Edit: I'm not sure I like this option because it breaks compatibility with USEARCH. There has got to be a better option, and I think it may be remapping with --search_exact.

torognes commented 9 years ago

@colinbrislawn I'll try to add the search_exact command soon. Thanks for the suggestion.

torognes commented 9 years ago

What if we always include the full previous FASTA header after a space after the new identifier (sha1/md5/prefix+number) when relabelling?

Example:

Before relabelling:

>69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA

After relabelling with sha1:

>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737 69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA

I think will be quite general and should break few scripts. We could introduce another option, e.g. --relabel_keep (or similar) to include the old header.

gregcaporaso commented 9 years ago

@torognes, that solution works. And the resulting file is still valid fasta, so no (correct) fasta parsers should break. Thanks for the quick responses on this. vsearch is awesome!

@colinbrislawn, your solution would result in the file not being .uc anymore like you mention. While the content would be what we need, there's a lot of benefit to sticking with a file format that other tools are already using. (As a side note, my initial thought was that an adapted uc file could just add the seed's hash in the last existing column for the S, H and C lines, and therefore not need a new column. I don't think this is a good solution either though, for the same reason - vsearch should continue to output standard uc files. The last thing bioinformatics needs is yet another file format.)

colinbrislawn commented 9 years ago

@torognes I really like your idea of having --relabel replace current labels by default, and the optional setting --relabel_keep to preserve labels as Greg suggested.

@gregcaporaso I agree that preserving file formats is important, and really like Torbjørn's suggestion because it yields valid .fasta and .uc files. I also like the idea of partitioning functionality between functions and files types. Parsing a .fasta file into an abundance table seems very strange to me, while parsing a .uc file of hits seems natural. I guess this is why I like the idea of --derep_fulllength followed by --search_exact. I'm struggling to describe this distinction... (I appreciate this discussion and look forward to increased integration of qiime and vsearch!)

torognes commented 8 years ago

The --relabel_keep option is added in vsearch 1.7.0. The old identifier can be found after the new label and a space in the new headers.

Remember that there is also an --notrunclabels option to keep the entire header as the identifier, not just the part before the space.

gregcaporaso commented 8 years ago

Excellent, thanks so much!

torognes / vsearch

possible for .uc files to be relabel_sha1-aware? #129