shendurelab / MIPGEN

One stop MIP design and analysis
Other
22 stars 19 forks source link

Inconsistent results using the mipgen_smmip_collapser.py #27

Closed DNAbased closed 7 years ago

DNAbased commented 7 years ago

Hello, we are currently working on establishing an analysis pipeline for our smMIPs sequencing data. The current setup (mainly based on the pipeline builder) is as follows:

However, doing some test analyses (multiple pipeline runs on the same samples), we noticed that the called variants were not consistent, i.e. the number of called variants, as well as the quality values for the individual variants differed. Breaking the pipeline down into its abovementioned components, we were able to trace this inconsistency to the smMIPs collapser, which seemingly generated different output for each run. Trying to obtain consistent results over several runs, we tinkered with the different options, leading to the identification of the "-C" (confidence level) option as our "suspected culprit". Our combined knowledge of python is, unfortunately, not enough to thorougly understand the genome_sam_collapser, called upon by the smMIP collapser. However, doing some test runs with different "-C" values (i.e. 0.0, 0.9, and 1.0) showed us that running the script with "-C 1" several times led to the same results with every run, while the other values did not [EDIT: this should have been "-C 0", which indeed led to the reproducible results!]. The info for "-C" using "python mipgen_smmip_collapser.py -h" did not really clarify how this confidence level plays into the analysis. From our understanding, a high confidence (such as the default of "-C 0.9" or the maximum of "-C 1") should lead to the "best" (i.e. most confident) collapsed smMIPs. Seeing that the results differ, that is apparently not the case. Would it be possible for you to explain how the smMIPs are collapsed (based on which parameters)? Is it a permutation-based approach, as indicated by the "high confidence leads to more random sampling" in the help file? If so, would reproduction of our results still be possible, seeing that they differ from run to run?

We will be highly grateful for any help you can provide on this matter.

augustboyle commented 7 years ago

Hello,

I meant to respond sooner but was delayed! I apologize.

The confidence score is equal to one minus the probability that a molecular tag collision will be inappropriately collapsed. It is very naive and needs to be tuned to your application and MIP set. If the probability is too high than MIPgen will take a base call at random to represent a UMI group.If it is sufficiently low than it will take the most frequent. At 1 it will always take a random one, which is unbiased. At 0 it will always take consensus. That said, if there is a tie in the most frequent base call, it will choose between the two at random, although this is unlikely because q scores are taken into account.

I am surprised that 1 is more reproducible than 0, but it does depend on your UMI/molecular tag and whether there is a strong nucleotide bias.

Because the selection in case of ties is random it is impossible to be completely deterministic. There are definitely sophisticated ways to improve this behavior and make it consistent but I have not pursued that direction. You can add a seed to make the genotype the same for a strictly equivalent input, but aside from that there are no simple ways to get the same output for a given site without biasing the base calls.

Evan

On July 21, 2017 at 6:10:19 AM, DNAbased (notifications@github.com) wrote:

Hello, we are currently working on establishing an analysis pipeline for our smMIPs sequencing data. The current setup (mainly based on the pipeline builder) is as follows:

However, doing some test analyses (multiple pipeline runs on the same samples), we noticed that the called variants were not consistent, i.e. the number of called variants, as well as the quality values for the individual variants differed. Breaking the pipeline down into its abovementioned components, we were able to trace this inconsistency to the smMIPs collapser, which seemingly generated different output for each run. Trying to obtain consistent results over several runs, we tinkered with the different options, leading to the identification of the "-C" (confidence level) option as our "suspected culprit". Our combined knowledge of python is, unfortunately, not enough to thorougly understand the genome_sam_collapser, called upon by the smMIP collapser. However, doing some test runs with different "-C" values (i.e. 0.0, 0.9, and 1.0) showed us that running the script with "-C 1" several times led to the same results with every run, while the other values did not. The info for "-C" using "python mipgen_smmip_collapser.py -h" did not really clarify how this confidence level plays into the analysis. From our understanding, a high confidence (such as the default of "-C 0.9" or the maximum of "-C 1") should lead to the "best" (i.e. most confident) collapsed smMIPs. Seeing that the results differ, that is apparently not the case. Would it be possible for you to explain how the smMIPs are collapsed (based on which parameters)? Is it a permutation-based approach, as indicated by the "high confidence leads to more random sampling" in the help file? If so, would reproduction of our results still be possible, seeing that they differ from run to run?

We will be highly grateful for any help you can provide on this matter.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/shendurelab/MIPGEN/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AF01p5KPjsrFgMiB_VjUR7p4RkltgfvWks5sQKM7gaJpZM4OfYcj .

DNAbased commented 7 years ago

Thank you for your reply and for your explanation. Regarding your surprise about the reproducibility: That was a mistake in my initial comment; I have edited it, indicating that it was indeed "-C 0" that led to the reproducible results.