Closed gregcaporaso closed 8 years ago
This is not possible yet, but could easily be added. Will look into it soon.
Great, this would be really helpful.
Do you want only the header of the representative sequences to be replaced by their sha1 hash in the uc file? If the identifiers of all sequences in the uc file are replaced by their sha1 hash, all sequences in the same cluster will get identical identifiers since their sequences are identical, making them somewhat meaningless.
Ah, good point, hadn't thought of that...
My use case is that I want to dereplicate sequences that were demultiplexed with QIIME, and build a BIOM table. For that purpose, I need to define OTU/observation ids, and the sha1 of the sequences is perfect for that as it would allow merging of tables that are created in different runs of vsearch. But, I'm realizing there is a problem with this: I need the sample identifiers (which are part of the input sequence identifiers in the QIIME demultiplexed sequence files) when I build the BIOM table so relabeling these in the uc file wouldn't work. The input file I'm working with looks like:
>69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTCCGCCGTCAAATCCCAGGGCTCAACCCCGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAATG
>3N5W8XDEVDFMR_1 M00365:23:000000000-A5A85:1:1101:17158:1603 1:N:0:0 orig_bc=TCGAGACGCTTA new_bc=TCGAGACGCTTA bc_diffs=0
TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGACAGTGGTGAAATCCCCGGGCTCAACCTGGGAACTGCCATTGTGACTGCAAGGCTAGGGTGCGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA
Where, for example, 69WD3WM3D1EYU
is my sample identifier, and 0
is an arbitrary sequence identifier.
Could the original sequence identifiers be included in the fasta comment field when passing --relabel_sha1
? That fasta would then go from looking like this:
>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
>0c7f96a224e198b80cf0c29213c7109ec720cdf6
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCCATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATGGCTTGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
to this:
>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737 69WD3WM3D1EYU_0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
>0c7f96a224e198b80cf0c29213c7109ec720cdf6 3N5W8XDEVDFMR_1
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCCATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATGGCTTGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
(not sure if those sha1 values match up to those exact sequences, just created the example).
Then I can parse the fasta file to build a map of the old to new identifiers from the fasta file to use when creating my BIOM table.
Hello @gregcaporaso
I want to dereplicate sequences that were demultiplexed with QIIME, and build a BIOM table. ... Could the original sequence identifiers be included in the fasta comment.
I would describe that as 'adding a label.' To me, 're-labeling' implies changing the names of the reads because each read represents something different.
69WD3WM3D1EYU
0c7f96a224e198df6
These methods all end with a fasta file. In order to build an abundance table / BIOM file, remap your original reads to this fasta file. For example:
[map_test]$ ls
seqs.fna
[map_test]$ vsearch --derep_fulllength seqs.fna --output seqs.derep.fna --relabel_sha1
[map_test]$ vsearch -usearch_global seqs.fna -db seqs.derep.fna -strand plus -id 1.0 -uc map_100.uc -threads 16
[map_test]$ ls
map_100.uc seqs.derep.fna seqs.fna
[map_test]$ head map_100.uc
H 40668 252 100.0 + 0 0 = LBNL.A.8_0 41394784bd8a763fdb9f0655ae6ada4ba9028476
H 90135 253 100.0 + 0 0 = UNC.A.2_6 20123f913c018156a801783b741f7cd6c0912971
H 60 253 100.0 + 0 0 = LBNL.B.2_8 0d7e725551fc8d38230695b78431cb7e19edd59e
H 2840 252 100.0 + 0 0 = UNC.A.8_11 3fa57742efd2a46660fe2340dc0582794301e8f9
H 16444 253 100.0 + 0 0 = UNC.A.8_4 47abd53ffaad23920ce1a384fab31ee377e77356
H 17 253 100.0 + 0 0 = LBNL.B.3_1 adb7f636c106b2cb763f53382c091821ded73e51
H 60 253 100.0 + 0 0 = LBNL.B.2_7 0d7e725551fc8d38230695b78431cb7e19edd59e
H 1317 252 100.0 + 0 0 = LBNL.A.8_10 607f3623014807e3555ed3726fff61a519a905a3
H 41006 252 100.0 + 0 0 = LBNL.A.8_12 e784317def5ab28d30cb577627dab4262fdfe0b3
H 159 253 100.0 + 0 0 = LBNL.A.8_13 bd5e6dc8b19bdd592ccfb5f1c59041f90f31435c
[map_test]$ uc2otutab_mod.py map_100.uc > biom_table.txt
[map_test]$ biom convert --table-type="OTU table" -i biom_table.txt -o biom_table.txt.biom --to-json
Does this do what you want it to do?
Edit: remove example of OTUs.
No, I'm not looking to do OTU clustering. Instead, I want the dereplicated sequences to represent my OTUs. I know how to do this, but including the hashs with the vsearch output that I'm already generating would speed up the process.
I want the dereplicated sequences to represent my OTUs.
Ah ok. The example code I provided does this. (I should not have mentioned OTU clustering in my example.)
Yes, but it's requiring two runs of vsearch where the second one is going to be relatively very slow. I have all of the information that I need from the vsearch --derep_fulllength
command that I posted above. I'm just looking to get a mapping of the original sequence id to the hash so I don't have to recompute the hashes to use them as OTU ids.
Point taken. Remapping with global alignments is very slow. Remapping with --search_exact
would be fast, but would still involve another step.
What if the --uc
output of --derep_fulllength
was different. Current version:
[data]$ head -n 5 seqs.derep.uc
S 0 252 * * * * * UNC.A.7_72 *
H 0 252 100.0 * 0 0 * LBNL.B.7_178 UNC.A.7_72
H 0 252 100.0 * 0 0 * LBNL.Ctl_184 UNC.A.7_72
H 0 252 100.0 * 0 0 * LBNL.A.7_331 UNC.A.7_72
H 0 252 100.0 * 0 0 * LBNL.A.7_335 UNC.A.7_72
What if the new sha1 label was listed in an additional column? Like this:
[data]$ head -n 5 rdp_derep.uc
S 0 252 * * * * * UNC.A.7_72 * abc4a8123e72881889eee5728ea
H 0 252 100.0 * 0 0 * LBNL.B.7_178 UNC.A.7_72 abc4a8123e72881889eee5728ea
H 0 252 100.0 * 0 0 * LBNL.Ctl_184 UNC.A.7_72 abc4a8123e72881889eee5728ea
H 0 252 100.0 * 0 0 * LBNL.A.7_331 UNC.A.7_72 abc4a8123e72881889eee5728ea
H 0 252 100.0 * 0 0 * LBNL.A.7_335 UNC.A.7_72 abc4a8123e72881889eee5728ea
(made up, example hashes)
You can then pull those hashes from the .uc
output or parse that into a .biom table.
Edit: I'm not sure I like this option because it breaks compatibility with USEARCH. There has got to be a better option, and I think it may be remapping with --search_exact
.
@colinbrislawn I'll try to add the search_exact command soon. Thanks for the suggestion.
What if we always include the full previous FASTA header after a space after the new identifier (sha1/md5/prefix+number) when relabelling?
Example:
Before relabelling:
>69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
After relabelling with sha1:
>431c7dc4fd5f8e584d5fb9c6b8357b9b7a261737 69WD3WM3D1EYU_0 M00365:23:000000000-A5A85:1:1101:17187:1602 1:N:0:0 orig_bc=ATCTGCCTGGAA new_bc=ATCTGCCTGGAA bc_diffs=0
TACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTATTCAAGTCAGAGGTGAAAGCCC
GGGGCTCAACCCCGGAACTGCCTTTGAAACTAGATAGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTG
AAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTG
GGGAGCAAACA
I think will be quite general and should break few scripts. We could introduce another option, e.g. --relabel_keep
(or similar) to include the old header.
@torognes, that solution works. And the resulting file is still valid fasta, so no (correct) fasta parsers should break. Thanks for the quick responses on this. vsearch is awesome!
@colinbrislawn, your solution would result in the file not being .uc anymore like you mention. While the content would be what we need, there's a lot of benefit to sticking with a file format that other tools are already using. (As a side note, my initial thought was that an adapted uc file could just add the seed's hash in the last existing column for the S
, H
and C
lines, and therefore not need a new column. I don't think this is a good solution either though, for the same reason - vsearch should continue to output standard uc files. The last thing bioinformatics needs is yet another file format.)
@torognes I really like your idea of having --relabel
replace current labels by default, and the optional setting --relabel_keep
to preserve labels as Greg suggested.
@gregcaporaso I agree that preserving file formats is important, and really like Torbjørn's suggestion because it yields valid .fasta and .uc files.
I also like the idea of partitioning functionality between functions and files types. Parsing a .fasta file into an abundance table seems very strange to me, while parsing a .uc file of hits seems natural. I guess this is why I like the idea of --derep_fulllength
followed by --search_exact
. I'm struggling to describe this distinction...
(I appreciate this discussion and look forward to increased integration of qiime and vsearch!)
The --relabel_keep
option is added in vsearch 1.7.0. The old identifier can be found after the new label and a space in the new headers.
Remember that there is also an --notrunclabels
option to keep the entire header as the identifier, not just the part before the space.
Excellent, thanks so much!
I'm using vsearch in dereplication mode. In my output fasta I would like to have the sequences relabeled based on their sha1, and I'd like to have a
.uc
file generated as well. This works with the following command:However the issue I'm running into is that the sequence identifiers in the
.uc
file are the original identifiers, not the relabeled (sha1) identifiers. Would it be possible for vsearch to write the relabeled identifiers instead of the original identifiers to the.uc
file? (Or is it possible, and I'm just missing that option?)Thanks!