qiime2 / q2-vsearch

vsearch plugin for QIIME 2
BSD 3-Clause "New" or "Revised" License
6 stars 20 forks source link

BUG: derep-seqs produces mismatched feature IDs #57

Closed thermokarst closed 1 year ago

thermokarst commented 5 years ago

Bug Description vsearch dereplicate-sequences adds Sample ID information to the Feature IDs of the FeatureData[Sequence] output, but not the Feature IDs of the FeatureTable[Frequency] output. This causes problems downstream when attempting to utilize Actions that require both as inputs --- the Feature IDs no longer match.

Steps to reproduce the behavior

  1. Run dereplicate-sequences on any input
  2. Export the FeatureData[Sequence]
  3. Note the modified Feature IDs: e.g. >9142dd139a96f63aba52d4e88bdcb803981e3467 L2S204_3773

Expected behavior Feature IDs should be consistent in both outputs. Probably no need to include the Sample ID info that is being tacked on.

References

  1. Original forum post
torognes commented 5 years ago

This issue is probably due to the use of the --relabel_keep option when running vsearch --derep_fulllength with relabelling. It adds the old label after the new label.

Oddant1 commented 4 years ago

Simply removing the --relabel_keep flag causes the following exception when attempting to update ids in the biom table Mapping not provided for observation identifier: L2S204_3773. If this identifier should not be updated, pass strict=False. Passing strict=False unsurprisingly causes the ids to not show up in the final biom table. Can we maybe not use the SHA at all and only use the sample_id somehow?

Oddant1 commented 4 years ago

I've been playing with this issue during some downtime today, and it doesn't look like a satisfactory solution will come from just playing with what we pass to vsearch. Unfortunately, I'm getting a stunted traceback and no reported log file from the errors being caused by messing with the vsearch command, but I'll open a PR once I have some idea of what a solution will look like.

thermokarst commented 3 years ago

My understanding on this was incorrect - when I first wrote this I wasn't aware that some/many FASTA dialects support a description field. So, when I first wrote this issue, I assumed that this meant that vsearch was somehow mutating the feature IDs in the FASTA file, but that is not the case. I actually don't think we should remove these description fields, since many tools that consume FASTA do know what to do with them. I think the originating Forum post that brought this up encountered a tool that isn't aware of the description field - fasttree. Fasttree appears to just copy the entire header line into the phylogeny's tips, as though the whole line was an ID. I poked around, and don't have a great solution. Perhaps a new method for stripping the description fields would work? Either way, I'm closing this issue, I don't think this is something that needs to be fixed.