qiime2 / q2-vsearch

vsearch plugin for QIIME 2
BSD 3-Clause "New" or "Revised" License
6 stars 20 forks source link

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs #55

Open nbokulich opened 6 years ago

nbokulich commented 6 years ago

Improvement Description Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in dereplicate-sequences.

Current Behavior Seq hashes are used by default.

Proposed Behavior Expose a --p-hashed-feature-ids parameter to choose how sequence IDs get handled.

References

  1. forum xref
colinbrislawn commented 6 years ago

Should we request this as a feature of vsearch? Vsearch currently supports:

--relabel string
  Relabel sequences using the prefix string and a ticker
--relabel_md5
--relabel_sha1

Colin

colinbrislawn commented 6 years ago

Wait... there are several, nested feature requests here!

Is this what we want?

>ACTTTTTTG
ACTTTTTTG

Having a sequence with identical ID and sequence seems a little silly to me, but if both dada2 and deblur implement this natively, then I'm comfortable requesting it for vsearch. However, if this is an option within the q2 plugins, maybe we should implement this within Q2-vsearch.

Colin

colinbrislawn commented 5 years ago

Hello @torognes, what do you think about a --relabel_self option in vsearch that relabels fasta headers so they identical to their sequences? Like this

>GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGT
>CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG
>CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGC
torognes commented 5 years ago

Hi @colinbrislawn, yes, that's a feature that should be easy to add to vsearch. I'll add it to issues for vsearch and implement it soon.

colinbrislawn commented 5 years ago

Thanks @torognes!

@nbokulich I'll add --p-hashed-feature-ids / --p-no-hashed-feature-ids to match dada2 and deblur.

As far as I can see, the reads will be hashed with sha1, which conflicts with the md5 of dada2... Should we make an option for other values or keep vsearch consistent with dada2 and deblur?

nbokulich commented 5 years ago

Thanks @colinbrislawn !

Looks like VSEARCH has both --relabel_md5 and --relabel_sha1 options. So in q2-vsearch instead of a boolean option hashed_feature_ids you could make this a multi-choice string. Something like: hashed_feature_ids = Str % Choices(['md5', 'sha1', 'unhashed'])

colinbrislawn commented 5 years ago

So --relabel_self is now in vsearch v2.14.0 and up. All our options are on the table.

Looks like both this issue and #48 can't be closed until the vsearch version is bumped. While we wait for the bump, I'll try to get this PR submitted added before the October 18th deadline.

colinbrislawn commented 2 years ago

It looks like removing the hashes breaks this section:

 id_map = {e.metadata['description']: e.metadata['id']
              for e in skbio.io.read(str(dereplicated_sequences),

With just a sample ID, instead of hash + sample ID, this section breaks.

What's the recommended way to build this id_map without hashes?