qiita-spots / qp-deblur

2 stars 7 forks source link

Add parameter set to keep singletons #62

Closed ElDeveloper closed 3 years ago

ElDeveloper commented 3 years ago

Do you think it would make sense to add a new parameter set that keeps singletons around? As best as I can tell this would be achieved by changing the Minimum per-sample read threshold parameter. This would be useful for internal meta-analyses where singletons across multiple preps/studies might be meaningful.

This might be a ticket for qiita itself or an e-mail to qiita-help, but I figured I would start here.

antgonza commented 3 years ago

If you go to https://qiita.ucsd.edu/workflows/ and check the current parameters for deblur, we have:

Screen Shot 2021-03-29 at 1 26 04 PM

You mean changing the 2 for a 1? If yes, it will be good to discuss in the qiita.admin (not help) forum - could you send an email there with all the details? Thank you.

wasade commented 3 years ago

It may be useful to consult with Amnon about this. Singletons which are not proximal to a cluster by hamming distance are likely (though not assured) to be error.

On Mar 29, 2021, at 12:29 PM, Antonio Gonzalez @.***> wrote:

If you go to https://qiita.ucsd.edu/workflows/ https://qiita.ucsd.edu/workflows/ and check the current parameters for deblur, we have: https://user-images.githubusercontent.com/2014559/112889103-722b5700-9092-11eb-8194-6a8828d7d20c.png You mean changing the 2 for a 1? If yes, it will be good to discuss in the qiita.admin (not help) forum - could you send an email there with all the details? Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qiita-spots/qp-deblur/issues/62#issuecomment-809650454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADTZMXSPCBY4RA6FQIAI6LTGDIKXANCNFSM42AFQGGA.

ElDeveloper commented 3 years ago

@amnona do you have any input on @wasade's point? We are wondering if it would make sense to keep per-sample singletons.

antgonza commented 3 years ago

@ElDeveloper, just to get some context; what's the idea behind this request? Perhaps clarifying why going from 2 reads per sample to 1 will help.

amnona commented 3 years ago

Hi, well the main reason we drop singletons is that in singletons, the discrete (as opposed to continuous) nature of the reads creates can create a statistical problem. For example, say we have a real sequence S with 5 reads, and say we have a 0.1 probability for hamming-1 read errors (i.e. 0.1 of the 150bp reads are expected to contain 1 read error). Using the deblur algorithm, we then reduce 0.1 from the read counts of all the hamming-1 sequences from S, as this is what would be expected in the continuous case. However, since reads are discrete, we will get 0 reads in most hamming-1 sequences and 1 in a few hamming-1 sequences, and since deblur only subtracts the expected number (0.1), the hamming-1 read errors would still be present (0.9) and therefore will not be cleaned. So keeping the singletons may result in keeping a large number of singleton noise reads. You could think of changing deblur, so it does not throw singletons, but instead rounds down the number of reads following the algorithm (so 0.9 turns to 0). Don't know if it's worth the trouble, as it would require testing on some datasets to see that it actually works and does not introduce new unexpected problems. Additionally, I think in most cases sequences with 1 read do not have an important biological meaning (and if they do, the experiment does not have enough reads per sample, since while you see some sequences with 1 read, you miss lots of others...).

What do you think? Can you elaborate the reason to keep the singletons?

ElDeveloper commented 3 years ago

Thanks @amnona this is super helpful. So while we might be able to keep some real singletons we are also likely to keep noisy reads and changing this would require further validation. Makes sense.

The original motivation for this request came in the context of a large meta-analysis for >10 studies that were generated at the same sequencing facility but over the course of several months. The main idea was to check for sequences that in a single study might be considered "singletons or noisy reads" but that when combined in multiple studies could have a meaningful signal.

I'll close this issue since it sounds like this isn't a ticket for Qiita as it is more for deblur or additional research work to be done on deblur.

Thanks everyone!