Closed thermokarst closed 1 year ago
This issue is probably due to the use of the --relabel_keep
option when running vsearch --derep_fulllength
with relabelling. It adds the old label after the new label.
Simply removing the --relabel_keep
flag causes the following exception when attempting to update ids in the biom table Mapping not provided for observation identifier: L2S204_3773. If this identifier should not be updated, pass strict=False.
Passing strict=False
unsurprisingly causes the ids to not show up in the final biom table. Can we maybe not use the SHA at all and only use the sample_id somehow?
I've been playing with this issue during some downtime today, and it doesn't look like a satisfactory solution will come from just playing with what we pass to vsearch. Unfortunately, I'm getting a stunted traceback and no reported log file from the errors being caused by messing with the vsearch command, but I'll open a PR once I have some idea of what a solution will look like.
My understanding on this was incorrect - when I first wrote this I wasn't aware that some/many FASTA dialects support a description field. So, when I first wrote this issue, I assumed that this meant that vsearch was somehow mutating the feature IDs in the FASTA file, but that is not the case. I actually don't think we should remove these description fields, since many tools that consume FASTA do know what to do with them. I think the originating Forum post that brought this up encountered a tool that isn't aware of the description field - fasttree. Fasttree appears to just copy the entire header line into the phylogeny's tips, as though the whole line was an ID. I poked around, and don't have a great solution. Perhaps a new method for stripping the description fields would work? Either way, I'm closing this issue, I don't think this is something that needs to be fixed.
Bug Description
vsearch dereplicate-sequences
adds Sample ID information to the Feature IDs of theFeatureData[Sequence]
output, but not the Feature IDs of theFeatureTable[Frequency]
output. This causes problems downstream when attempting to utilize Actions that require both as inputs --- the Feature IDs no longer match.Steps to reproduce the behavior
dereplicate-sequences
on any inputFeatureData[Sequence]
>9142dd139a96f63aba52d4e88bdcb803981e3467 L2S204_3773
Expected behavior Feature IDs should be consistent in both outputs. Probably no need to include the Sample ID info that is being tacked on.
References