lehner-lab / DiMSum

An error model and pipeline for analyzing deep mutational scanning (DMS) data and diagnosing common experimental pathologies
MIT License
28 stars 6 forks source link

Is it possible to have input replicates in the same experiment? #5

Closed CharlesJB closed 2 years ago

CharlesJB commented 2 years ago

Hello,

I am currently trying create the exp_design file for an analysis where 3 transformations were performed:

I'm trying to find a way to produce the exp_design in a way that will pass all the checks but I can't seem to make it work, especially for the untreated samples.

I tried to define the 3 untreated samples for the 2nd transformation and the 2 untreated samples from the 3rd transformation as technical replicates but it throws an error at this check.

If I understand correctly, untreated samples must absolutely be in their own experiment?

If that is the case, should I combine the untreated samples replicates from the same experiment in a single sample if I was to make full use of the 3 treated samples in the 2nd transformation and 2 treated samples from the 3rd transformation?

Thank you for your help!

andrefaure commented 2 years ago

Hi @CharlesJB,

If I understand correctly I think the best would be to simply include the additional samples from the same transformation as technical replicates. Internally, the read counts for different technical replicates of the same experiment replicate (transformation in your case) are simply summed.

Adapting the exp_design template here, your file should then look something like this:

sample_name experiment_replicate selection_id selection_replicate technical_replicate pair1 pair2 input1 1 0 1 Input_Rep1_t1_read1.fastq Input_Rep1_t1_read2.fastq input2 2 0 1 Input_Rep2_t1_read1.fastq Input_Rep2_t1_read2.fastq input2 2 0 2 Input_Rep2_t2_read1.fastq Input_Rep2_t2_read2.fastq input2 2 0 3 Input_Rep2_t3_read1.fastq Input_Rep2_t3_read2.fastq input3 3 0 1 Input_Rep3_t1_read1.fastq Input_Rep3_t1_read2.fastq input3 3 0 2 Input_Rep3_t2_read1.fastq Input_Rep3_t2_read2.fastq output1A 1 1 1 1 Output1_Rep1_t1_read1.fastq Output1_Rep1_t1_read2.fastq output2A 2 1 1 1 Output1_Rep2_t1_read1.fastq Output1_Rep2_t1_read2.fastq output2A 2 1 1 2 Output1_Rep2_t2_read1.fastq Output1_Rep2_t2_read2.fastq output2A 2 1 1 3 Output1_Rep2_t3_read1.fastq Output1_Rep2_t3_read2.fastq output3A 3 1 1 1 Output1_Rep3_t1_read1.fastq Output1_Rep3_t1_read2.fastq output3A 3 1 1 2 Output1_Rep3_t2_read1.fastq Output1_Rep3_t2_read2.fastq

Hope this helps!

CharlesJB commented 2 years ago

Hello,

Thank you for the quick reply!

If I understand your answer, I should try this following exp_design:

> exp_design
   sample_name experiment_replicate selection_id selection_replicate
1       input1                    1            0                  NA
2       input2                    2            0                  NA
3       input2                    2            0                  NA
4       input2                    2            0                  NA
5       input3                    3            0                  NA
6       input3                    3            0                  NA
7     output1A                    1            1                   1
8     output2A                    2            1                   1
9     output2A                    2            1                   1
10    output2A                    2            1                   1
11    output3A                    3            1                   1
12    output3A                    3            1                   1
   technical_replicate                       pair1                       pair2
1                    1   Input_Rep1_t1_read1.fastq   Input_Rep1_t1_read2.fastq
2                    1   Input_Rep2_t1_read1.fastq   Input_Rep2_t1_read2.fastq
3                    2   Input_Rep2_t2_read1.fastq   Input_Rep2_t2_read2.fastq
4                    3   Input_Rep2_t3_read1.fastq   Input_Rep2_t3_read2.fastq
5                    1   Input_Rep3_t1_read1.fastq   Input_Rep3_t1_read2.fastq
6                    2   Input_Rep3_t2_read1.fastq   Input_Rep3_t2_read2.fastq
7                    1 Output1_Rep1_t1_read1.fastq Output1_Rep1_t1_read2.fastq
8                    1 Output1_Rep2_t1_read1.fastq Output1_Rep2_t1_read2.fastq
9                    2 Output1_Rep2_t2_read1.fastq Output1_Rep2_t2_read2.fastq
10                   3 Output1_Rep2_t3_read1.fastq Output1_Rep2_t3_read2.fastq
11                   1 Output1_Rep3_t1_read1.fastq Output1_Rep3_t1_read2.fastq
12                   2 Output1_Rep3_t2_read1.fastq Output1_Rep3_t2_read2.fastq

But it seems to be triggering this check.

If I try directly in R:

> ### Duplicate matrix row checks
> #Check for duplicated rows in the following columns: "experiment_replicate", "selection_id", "selection_replicate"
> if(sum(duplicated(exp_design[,c("experiment_replicate", "selection_id", "selection_replicate")]))!=0){
+   stop(paste0("One or more duplicated rows in experimentDesign file matrix (sample rows should be unique)"), call. = FALSE)
+ }
Error: One or more duplicated rows in experimentDesign file matrix (sample rows should be unique)

More precisely:

> exp_design[,c("experiment_replicate", "selection_id", "selection_replicate")]
   experiment_replicate selection_id selection_replicate
1                     1            0                  NA
2                     2            0                  NA
3                     2            0                  NA
4                     2            0                  NA
5                     3            0                  NA
6                     3            0                  NA
7                     1            1                   1
8                     2            1                   1
9                     2            1                   1
10                    2            1                   1
11                    3            1                   1
12                    3            1                   1
> sum(duplicated(exp_design[,c("experiment_replicate", "selection_id", "selection_replicate")]))
[1] 6

But if the technical replicates are merged anyway, it might be simpler to merge them before starting the analysis. It will make the exp_design easier to manage.

I am not familiar enough with DiMSum algorithm so I'm probably missing something, but wouldn't it be interesting to be able to use replicate for the input of a transformation for the statistical analysis?

andrefaure commented 2 years ago

Hi @CharlesJB,

I see you are running DiMSum on your own custom count file then.

In this case, yes, you simply need to sum the counts for the technical replicates beforehand.

Alternatively, if you suspect that there is substantial variation between samples from the same transformation, you can compare the above results compare to those from DiMSum run using an experiment design file where all 6 samples are separate biological replicates (i.e. 1-6) and without summing counts for "technical replicates".

Let me know if you have any other doubts.

CharlesJB commented 2 years ago

Hello @andrefaure

By biological replicates, you mean the experiment_replicate in the exp_design?

For instance in my experiment, there are 3 transformations and for 2 of them we made multiple independent knockouts. This means that the input2 in the 2nd line of the design you sent me and the output2A in the 8th are matching in terms of knockouts while the input2 in the 2nd and 3rd line of the design are from different knockouts.

Would it make senses to have the following design in this case?


   sample_name experiment_replicate selection_id selection_replicate
1       input1                    1            0                  NA
2       input2                    2            0                  NA
3       input3                    3            0                  NA
4       input4                    4            0                  NA
5       input5                    5            0                  NA
6       input6                    6            0                  NA
7      output1                    1            1                   1
8      output2                    2            1                   1
9      output3                    3            1                   1
10     output4                    4            1                   1
11     output5                    5            1                   1
12     output6                    6            1                   1
   technical_replicate                       pair1                       pair2
1                   NA   Input_Rep1_t1_read1.fastq   Input_Rep1_t1_read2.fastq
2                   NA   Input_Rep2_t1_read1.fastq   Input_Rep2_t1_read2.fastq
3                   NA   Input_Rep2_t2_read1.fastq   Input_Rep2_t2_read2.fastq
4                   NA   Input_Rep2_t3_read1.fastq   Input_Rep2_t3_read2.fastq
5                   NA   Input_Rep3_t1_read1.fastq   Input_Rep3_t1_read2.fastq
6                   NA   Input_Rep3_t2_read1.fastq   Input_Rep3_t2_read2.fastq
7                   NA Output1_Rep1_t1_read1.fastq Output1_Rep1_t1_read2.fastq
8                   NA Output1_Rep2_t1_read1.fastq Output1_Rep2_t1_read2.fastq
9                   NA Output1_Rep2_t2_read1.fastq Output1_Rep2_t2_read2.fastq
10                  NA Output1_Rep2_t3_read1.fastq Output1_Rep2_t3_read2.fastq
11                  NA Output1_Rep3_t1_read1.fastq Output1_Rep3_t1_read2.fastq
12                  NA Output1_Rep3_t2_read1.fastq Output1_Rep3_t2_read2.fastq
andrefaure commented 2 years ago

Yes exactly - that is what I meant.

In this way in the DiMSum results you will be able to see how the 6 experiments cluster in terms of their read counts and fitness scores within transformations (different knockouts) and whether these correlations are comparable to those corresponding to different transformations.

CharlesJB commented 2 years ago

Thank you for your help!