Open nbokulich opened 6 years ago
Should we request this as a feature of vsearch? Vsearch currently supports:
--relabel string
Relabel sequences using the prefix string and a ticker
--relabel_md5
--relabel_sha1
Colin
Wait... there are several, nested feature requests here!
Is this what we want?
>ACTTTTTTG
ACTTTTTTG
Having a sequence with identical ID and sequence seems a little silly to me, but if both dada2 and deblur implement this natively, then I'm comfortable requesting it for vsearch. However, if this is an option within the q2 plugins, maybe we should implement this within Q2-vsearch.
Colin
Hello @torognes, what do you think about a --relabel_self
option in vsearch that relabels fasta headers so they identical to their sequences? Like this
>GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
>CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
>CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
Hi @colinbrislawn, yes, that's a feature that should be easy to add to vsearch. I'll add it to issues for vsearch and implement it soon.
Thanks @torognes!
@nbokulich I'll add --p-hashed-feature-ids
/ --p-no-hashed-feature-ids
to match dada2 and deblur.
As far as I can see, the reads will be hashed with sha1, which conflicts with the md5 of dada2... Should we make an option for other values or keep vsearch consistent with dada2 and deblur?
Thanks @colinbrislawn !
Looks like VSEARCH has both --relabel_md5
and --relabel_sha1
options. So in q2-vsearch instead of a boolean option hashed_feature_ids
you could make this a multi-choice string. Something like: hashed_feature_ids = Str % Choices(['md5', 'sha1', 'unhashed'])
So --relabel_self
is now in vsearch v2.14.0 and up. All our options are on the table.
Looks like both this issue and #48 can't be closed until the vsearch version is bumped. While we wait for the bump, I'll try to get this PR submitted added before the October 18th deadline.
It looks like removing the hashes breaks this section:
id_map = {e.metadata['description']: e.metadata['id']
for e in skbio.io.read(str(dereplicated_sequences),
With just a sample ID, instead of hash + sample ID, this section breaks.
What's the recommended way to build this id_map without hashes?
Improvement Description Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in
dereplicate-sequences
.Current Behavior Seq hashes are used by default.
Proposed Behavior Expose a
--p-hashed-feature-ids
parameter to choose how sequence IDs get handled.References