Backcrossed population - Githubissues

nirwan1265 commented 11 months ago

EDIT: I understand the problem now. I have some selfing in the population which will cause some level of homozygosity for the alternate allele and hence this will not work. [Keeping this issue on for anyone else]

Hello, Data description: I have backcross population of BC2S4s. for test, i am using 3 samples and 3 chromosomes.

I am running the function using nstates=2,

myres = RTIGER(expDesign = expDesign,
               outputdir = getwd(),
               seqlengths = chr_len,
               rigidity = 100,
               autotune = TRUE,
               nstates = 2 ,
               max.iter = 25,
               average_coverage = 0.8,
               save.results = TRUE)

i get the following error:

Loading data and generating RTIGER object.
Using 2 states for fitting.
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN7_SID600.allelicCounts.txt"
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN8_SID679.allelicCounts.txt"
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN9_SID781.allelicCounts.txt"

Fitting the parameters and Viterbi decoding. 
post processing value is: TRUE 
R value autotune is: TRUE 
Fitting an RTIGER object with the R provided by the user.
Optimizing the R parameter.
Error in params[["paraBetaAlpha"]][state, 1] : subscript out of bounds

It did work for nstates=3 however,

> myres2 = RTIGER(expDesign = expDesign,
+                outputdir = getwd(),
+                seqlengths = chr_len,
+                rigidity = 100,
+                autotune = TRUE,
+                max.iter = 25,
+                average_coverage = 0.8,
+                save.results = TRUE)
Loading data and generating RTIGER object.
Using 3 states for fitting.
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN7_SID600.allelicCounts.txt"
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN8_SID679.allelicCounts.txt"
[1] "Loading file:  /Users/nirwan/Desktop/Bzea_allelecount/filtered/PN9_SID781.allelicCounts.txt"

Fitting the parameters and Viterbi decoding. 
post processing value is: TRUE 
R value autotune is: TRUE 
Fitting an RTIGER object with the R provided by the user.
Optimizing the R parameter.
Best R value: 512 
Number of iterations run:  19 

Plotting samples Genotypes.
PLotting CO number per chromosome. 
Creating bed and IGV output formats.
Plotting goodness of fit.

Any help is appreciated.

rfael0cm commented 10 months ago

Hi nirwan, did you manage to fix it? Best, Rafael

nirwan1265 commented 10 months ago

Hi Rafael,

Sorry for the late reply. I was doing some changes with SNP calling protocol. After much filtering, using GATK protocol using different parameters for prior distribution and heterozygosity. I was able to get 30k, snps on average. avg_markers_allchrom

Since our population is backcrossed and selfed for 3-4 generations, we expect 86% homozygous ref, 3% heterozygous and 11% homozygous alternate. We have an average sequencing depth of 0.8 for our population. I used autotune on, seq depth for 0.8, and cross over value of 0.01 (calculated from previous data, about 3 crossovers per chromosome). and this is the result I got. This is one of the samples: GenotypePlot_SID1328.pdf

We do see only heterozygotes, but I think we can assume they are introgressed regions. I have a couple of questions: 1.) Is the number of alternate calls enough? 2.) can we take into account selfing generations in the backcross model? 3.) can we specify the expected introgression from teosintes so the program can use it as a prior?

rfael0cm commented 10 months ago

Hi Nirwan,

your plot shows that there is a majority of reference genotype all along your chromosomes, except chromosome 10 which is moslty heterozyigous. But, chromosomes 9, 8 and 7 are composed uniquely from the reference genotype. Looking at your data set I wonder why you do not see that 11% of the homozygous alternate. From a visual inspection, it looks like there are not much position with homozygous alternate counts at all (allele frequency ratio rarely achieves -100). While the reference ratio is close to 100 quite often. I can think about two possible problems:

Positions that represent your alternate genotype are not well represented in your set of genomic positions.
There are some problem building the input file for RTIGER.

Have you tried the method we suggest in our README file? Could you check these possibilities and get back to me? Could you tell me which is the R value that RTIGER has estimated?

Regarding your question 2 and 3: We have not added these options to the model because we only use the data to estimate the crossovers. This means that given a good density of the data points and counts, information on the selfing and introgression shouldn't be necessary. Our EM algorithm should be able to estimate the right underlying genotype without any problems.

Thank you a lot, Rafael

rfael0cm / RTIGER

Backcrossed population #12