qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

dada2 justConcatenate #129

Open ARW-UBT opened 4 years ago

ARW-UBT commented 4 years ago

Bug Description I have recently installed an Illumina iSeq-100 benchtop sequencer in my lab, and it was announced allready in 2018, that there will be a 2x 250 bp sequencing cartrige available soon. Well, it is now 2020, and no new kit has appeared.

Since I can produce now 2x 150 bp reads only, the wonderful joining option in dada2 cannot be applied for 16S V3V4 regions. However, there is the justConcatenate option in dada2 standalone/R that could help here.

Questions Actually, the justConcatenate option is not buit in in the q2 plugin. If I will concatenate outside the q2 workflow, the great provenance chain in q2 will be interrupted. Question to the plugin developers: would it be possible to add the justConcatenate option to q2/dada2. Or dou you have any suggestion how to use the justConcatenate data for q2.

Comments @benjjneb Thank you for directing me to this Forum and your comment that it might be possible with the existing dada2 plugin.

benjjneb commented 4 years ago

Q for the Q2 folks: Is there a concatenation option already implemented in one of the Q2 plugins that could be used prior to q2-dada2?

ARW-UBT commented 4 years ago

Hi Ben, did you receive any response to your post, I cannot see anything on github, but maybe, you were contacted by other channels?

If not, do you see and chance to enable the ‘—just-concatenate’ option in the q2 plugin locally (e.g. in my own local installation)?

Best regards Alfons

Von: Benjamin Callahan [mailto:notifications@github.com] Gesendet: Dienstag, 25. Februar 2020 03:13 An: qiime2/q2-dada2 q2-dada2@noreply.github.com Cc: Weig, Alfons A.Weig@uni-bayreuth.de; Author author@noreply.github.com Betreff: Re: [qiime2/q2-dada2] dada2 justConcatenate (#129)

Q for the Q2 folks: Is there a concatenation option already implemented in one of the Q2 plugins that could be used prior to q2-dada2?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/qiime2/q2-dada2/issues/129?email_source=notifications&email_token=AD74K7R3YSYOCY64T53ATNTRER5BHA5CNFSM4KZY3PFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM2IVHI#issuecomment-590645917, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD74K7QSUF5VSPAH22SCOALRER5BHANCNFSM4KZY3PFA.

nbokulich commented 4 years ago

Sorry @benjjneb and @ARW-UBT your questions came mid-release so I think got lost in everyone's pile.

Is there a concatenation option already implemented in one of the Q2 plugins that could be used prior to q2-dada2?

No, not currently. Just concatenating causes issues all down the line, with phylogeny, taxonomic classification, etc. Nor have we received many (maybe 2?) requests for this feature. So I think exposing this option in q2-dada2 is probably a low priority, but I am curious what others think.

Actually, the justConcatenate option is not buit in in the q2 plugin. If I will concatenate outside the q2 workflow, the great provenance chain in q2 will be interrupted.

Here is a forum topic that describes a similar question, in which I've given steps for modifying your local branch of q2-dada2. This will allow you to expose or adjust options that are not available in the current release version, preserving the provenance chain.

RobJamesRamos commented 7 months ago

Just to add one more voice the the currently small chorus. We use just concatenate on an AMF LSU pipeline that uses our own downstream processing. It would greatly simplify our pipeline to have this flag exposed in qiime https://link.springer.com/article/10.1007/s00572-022-01068-3.

nbokulich commented 1 month ago

Hi @benjjneb , I am warming up to the idea of exposing justConcatenate in q2-dada2, as now I have had some time to think now about how we could handle concatenated ASVs in QIIME 2 to avoid issues with taxonomy classification etc. So I would like to pick up this conversation.

One thing that continues to trouble me is that justConcatenate will concatenate everything, including reads that do have overlap. This could mess up phylogeny, taxonomy, etc. This should be less of an issue with amplicons with low length heterogeneity (e.g, 16S), provided that users use it responsibly. However, it would be a common issue with length-variable regions like ITS — so this is one reason why I have been against this for many years, I am opposed to the "just" part in justConcatenate.

For this reason I think that it would be useful to expose an option to merge and then concatenate reads that fail to merge (because of lack of overlap; still rejecting reads that have partial overlap with mismatches in the overlap region). Reviewing various issues in the dada2 issue tracker I see that you are concerned about biases that could be introduced by having a mix of merged and concatenated reads, and I acknowledge this, but in some cases this may be less of a bias than, e.g., when users use merged ASVs in a hypervariable region and hence systematically lose longer amplicons. So it all boils down to users needing to exercise some responsibility in their analysis (which is already the case).

If we feel that such an output should have restricted uses downstream, one option would be to introduce a new type for concatenated (+merged) ASVs. This would limit the downstream analyses that users could perform, though this might be overly restrictive so we might consider this a last resort.

How would you feel about implementing a merge+concatenate option in q2-dada2? I see from https://github.com/benjjneb/dada2/issues/279 that doing a merge + concatenate is simple; excluding reads with unacceptable mismatches and indels would take some more work, but maybe this is something that you have already worked on further?

I found this benchmark that looked at merging vs. concat vs. both vs. single-read only: https://link.springer.com/article/10.1186/s12859-021-04410-2

it shows marginal improvement with "both", though it looks like this is done prior to passing to dada2 if I understand Fig 1 correctly.

RobJamesRamos commented 1 month ago

For what it's worth, the merge+concatenate would be a good compromise for our use case. It would be nice to have both options, including "justConcatenate", so that we can be sure that all reads are processed the same way, but I'm coming from an LSU mindset where reads are very unlikely to overlap. I totally understand the use case for using both merge and concatenate for more variable regions like the ITS. All in all, if only a merge+concatenate option was implemented I think our pipeline would switch to using it.