mmollina / MAPpoly

Genetic maps in autopolyploids
24 stars 9 forks source link

Using MAPpoly for oats #43

Closed VianeiRother closed 2 years ago

VianeiRother commented 3 years ago

Hi Mollina I´m planning to use MAPpoly for an oats dataset. I found the following tutorial. https://mmollina.github.io/tutorials/hexa_fake/haxaploid_map_construction.html#introduction I sow some new updates in some commands, and my question is if this tutorial should still be followed in its entirety. I have 177000 markers on my dataset. Should I start with just a few of them to see if it works, or can I start with everythink? (I am working with my personal computer)

Thanks

gabrielgesteira commented 3 years ago

Hello Vianei,

Thanks for your interest in using MAPpoly. As you noticed, that tutorial is outdated and should not be followed entirely. Please use the autotetraploid tutorial instead, as it is up to date and contains all the commands you will need to build the map. We are working on an updated version of the hexaploid tutorial, which will be available soon. Regarding your dataset: Is it an allohexaploid (AADDCC) species? If the answer is yes, the expected segregation pattern allows you to consider it as a diploid species during the analysis (you can do so by reading the dataset with the argument ploidy=2). This is far less computationally demanding than an autopolyploid-based analysis. Also, a considerable number of markers may be removed after the initial filtering steps, reducing the required resources. In this case, your computer may do the job depending on the amount of RAM available. You can start the analysis with the whole dataset and follow all the steps up to the two-point analysis: this will be the most demanding step (est_pairwise_rf function), so you can try to reduce the ncpus argument if you face any issues while running it. You can do the same for all functions that support parallel computation - they will take longer to run, but use less computational resources overall.

Please let us know if you were able to load the dataset and start the analysis properly. Feel free to contact any questions or problems you may have during the process.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel First, thank you for the quick update. Also, my species is an allohexaploid (AADDCC). I will send you below the commands I used until now. Already in the "read_geno" command, I got an error. Is there any other command to load the file before I call for read_geno?

install.packages("devtools") library(mappoly) dat.dose.csv <- read_geno_csv(file.in = "dados_R.csv", ploidy = 2) Error in [.data.frame(dat, , 2) : undefined columns selected

I was working before with the following commands. Please, see if it is right and if it gives me the same results.

data = read.csv("dados_teste.csv", h=T) data2 = data[which(!is.na(data$P1) & !is.na(data$P2)),] write.csv(data2, file="dados_teste_fix.csv", quote=F, row.names=F) my_data = read_geno_csv("dados_teste_fix.csv", ploidy=2)

May I send you at least a fraction of my data for you to see if it´s properly arranged and to test all the commands? It may be the fastest way to get to the final.

Cheers

gabrielgesteira commented 3 years ago

Hello Vianei,

That's correct, the set of read_geno functions are meant to be the first ones. The second batch of commands was just a workaround for an issue with parental missing data in CSV files, which was already fixed in the development version of MAPpoly. If you reinstall MAPpoly from Github, the first command should do the job. Please try reinstalling MAPpoly from Github and running it again:

install.packages("devtools")
devtools::install_github("mmollina/mappoly", dependencies=TRUE)
library(mappoly)
dat.dose.csv <- read_geno_csv(file.in = "dados_R.csv", ploidy = 2)

If the issue persists, you can send me a fraction of your data then I will try to figure out what is going on. In this case, I would kindly ask you to run sessionInfo() and paste the output here.

Best regards,

VianeiRother commented 3 years ago

data_vianei.txt Hello Gabriel I I reinstalled MAPpoly, as you ask, and tried the commands again, but it didn´t work either. I got the folowing error: Error in read_geno_csv(file.in = "dados_r.csv", ploidy = 2) : unable to find function "read_geno_csv".

So if you don't mind, could you please try to run it, to see if there is something with my file? Also, if it is possible, could you look if the next commands do work on my files? I am sending you a fraction of the data in txt, about 20.000 markers, and the first 100 genotypes. Let me know if the file is correct. I copied and paste it on txt file, without any adjustments.

Find the session Info below:

sessionInfo() R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 [4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] rstudioapi_0.13 magrittr_2.0.1 usethis_2.0.1 devtools_2.4.1 pkgload_1.2.1
[6] R6_2.5.0 rlang_0.4.10 fastmap_1.1.0 tools_4.0.5 pkgbuild_1.2.0
[11] sessioninfo_1.1.1 cli_2.5.0 withr_2.4.2 ellipsis_0.3.1 remotes_2.3.0
[16] rprojroot_2.0.2 lifecycle_1.0.0 crayon_1.4.1 processx_3.5.1 purrr_0.3.4
[21] callr_3.7.0 fs_1.5.0 ps_1.6.0 testthat_3.0.2 curl_4.3
[26] memoise_2.0.0 glue_1.4.2 cachem_1.0.4 compiler_4.0.5 desc_1.3.0
[31] prettyunits_1.1.1

Cheers data_vianei.txt

gabrielgesteira commented 3 years ago

Hello Vianei,

It seems that the MAPpoly package was not loaded before the read_geno_csv function. Please make sure you have successfully installed and loaded the package before running any command. Also, this function expects values separated by commas, but your file has spaces instead. You can fix your file by running:

data = read.csv("data_vianei.csv", h=T, sep="")
write.csv(data, file="data_vianei_fix.csv", quote=F, row.names=F)
my_data = read_geno_csv("data_vianei_fix.csv", ploidy=2)

After fixing your dataset, I was able to follow all commands in the tutorial.

Please let us know if you were able to fix your file, load the package and run the analysis.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel Seems that the previous issue was overcome. Now the read_geno command gives me the following error: Error in dat[, 1] : incorrect number of dimensions.

Is this still related to my file? Or what does it mean?

Cheers

VianeiRother commented 3 years ago

Hello Gabriel I think I miss understood you on your last replay. I am working with csv files. I just sent you a txt file because I could just attach these kinds of files here. Anyway, I checked again my file here and I saw that there were some points between the genotypes. After removing them, I was able to get through the read.geno command.

I think that it´s OK now. Cheers

gabrielgesteira commented 3 years ago

Hello Vianei,

That's great! Now you will be able to follow the tutorial and build the maps. Thanks again for your interest in MAPpoly, and feel free to contact us anytime.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel

I had to reduce the size of the file to be able to follow the procedure, but it´s going through now. Question although. In the command "all.rf.pairwise$pairwise$900-4000", looking for recombination fraction between a particular pair of markers, I got the result of "Nule". And the next command "plot(all.rf.pairwise, first.mrk = 900, second.mrk = 4000)" I got an error: The requested combination of markers is not included in the two-point object. This means that I just haven´t recombination between these two markers or is there something wrong with my commands?

Cheers

gabrielgesteira commented 3 years ago

Hello Vianei,

This means that at least one of the markers you are trying to see is not present in the sequence. You can check which markers are present in the sequence by running all.rf.pairwise$seq.num. Then, change 900-4000 for any other combination between the available markers, and you will see their recombination fraction.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel I am having some trouble building the LOD Score matrices. Could it have something to do with the fact that I am working with just 5000 markers? I got the following erros:

id<-get_genomic_order(seq.init) Error in get_genomic_order(seq.init) : No sequence or sequence position information found. s.o <- make_seq_mappoly(id) Error in make_seq_mappoly(id) : id is not an object of class 'mappoly.data', 'mappoly.map', 'mappoly.chitest.seq', 'mappoly.unique.seq', 'mappoly.pcmap', 'mappoly.pcmap3d', 'mappoly.geno.ord', or 'mappoly.group' plot(mat, ord = s.o$seq.mrk.names, fact = 5) Error in plot.mappoly.rf.matrix(mat, ord = s.o$seq.mrk.names, fact = 5) : 's.o' object not found

VianeiRother commented 3 years ago

Those are my commands so far:

library(mappoly) data = read.csv("dados_r.csv", h=T, sep="") write.csv(data, file="dados_r_fix.csv", quote=F, row.names=F) my_data = read_geno_csv("dados_r_fix.csv", ploidy=2) print(my_data, detailed = TRUE) plot(my_data) plot_mrk_info(input.data = my_data, mrk = 500) print_mrk(my_data, mrks = 'solcap_snp_c2_1419') dat <- filter_missing(input.data = my_data, type = "marker", filter.thres = 0.05, inter = TRUE) dat <- filter_missing(input.data = my_data, type = "individual", filter.thres = 0.1, inter = TRUE) print(dat) pval.bonf <- 0.05/dat$n.mrk mrks.chi.filt <- filter_segregation(my_data, chisq.pval.thres = pval.bonf, inter = TRUE) seq.init <- make_seq_mappoly(mrks.chi.filt) plot(seq.init) n.cores = parallel::detectCores() - 1 all.rf.pairwise <- est_pairwise_rf(input.seq = seq.init, ncpus = n.cores) all.rf.pairwise all.rf.pairwise$seq.num all.rf.pairwise$pairwise$66-822 plot(all.rf.pairwise, first.mrk = 66, second.mrk = 822)

mat <- rf_list_to_matrix(input.twopt = all.rf.pairwise) id<-get_genomic_order(seq.init) s.o <- make_seq_mappoly(id) plot(mat, ord = s.o$seq.mrk.names, fact = 5)

gabrielgesteira commented 3 years ago

Hello Vianei,

This error is occurring because your dataset doesn't have sequence position information, so you shouldn't use the get_genomic_order function. As the function fails to find this information, everything after it will fail too. You can see the recombination and LOD matrices using the input order by changing the last commands:

mat <- rf_list_to_matrix(input.twopt = all.rf.pairwise)
plot(mat)

If the input order doesn't make sense, you can use the MDS algorithm to reorder the markers:

mat <- rf_list_to_matrix(input.twopt = all.rf.pairwise)
ord <- mds_mappoly(mat)
s.o <- make_seq_mappoly(ord)
plot(mat, ord = s.o$seq.mrk.names)

Best regards,

VianeiRother commented 3 years ago

Hi Gabriel

Right, it worked. In that case, I should probably take away the command: temp1 <- make_seq_mappoly(grs, j, genomic.info = 1), in: z <- as.numeric(colnames(grs$seq.vs.grouped.snp)[1:21]) LGS<-vector("list", 21) for(j in 1:21){ temp1 <- make_seq_mappoly(grs, j, genomic.info = 1) tpt <- make_pairs_mappoly(all.rf.pairwise, input.seq = temp1) temp2 <- rf_snp_filter(input.twopt = tpt, diagnostic.plot = FALSE) lgtemp <- get_genomic_order(temp2) LGS[[z[j]]] <- list(seq = make_seq_mappoly(lgtemp), tpt = tpt, ch = z[j]) } Is that correct?

VianeiRother commented 3 years ago

Looking again, I think I have to leave all of this out. Right?

gabrielgesteira commented 3 years ago

Hi Vianei,

You just need to modify the example to use the UPGMA groups and the input order. You can drop off the first command, the genomic.info argument and the last two functions inside the loop. This should look like:

LGS<-vector("list", 21)
for(j in 1:21){
    temp1 <- make_seq_mappoly(grs, j)
    tpt <- make_pairs_mappoly(all.rf.pairwise, input.seq = temp1)
    temp2 <- rf_snp_filter(input.twopt = tpt, diagnostic.plot = FALSE)
    LGS[[j]] <- list(seq = temp2, tpt = tpt)
} 

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel

I got the following error message from that command: Error in abs(x[2, 1]) : non-numeric argument to mathematical function

But the previous command: grs <- group_mappoly(input.mat = mat, expected.groups = 21, comp.mat = TRUE, inter = TRUE) is giving me some Warning message: In group_mappoly(input.mat = mat, expected.groups = 21, comp.mat = TRUE, : There is no physical reference to generate a comparison matrix

I suppose it´s because I don´t have the genomic reference. But to be sure, is the warning message related to the second error?

gabrielgesteira commented 3 years ago

Hello Vianei,

You are right, the warning message is happening due to the absence of genomic reference information. You can suppress it by dropping the comp.mat = TRUE argument. However, it is probably not related to the second error. I have seen this error before, and the reason was that one of the groups had only one marker (you can check it by looking at the output of grs). In this case, you can try to adjust the filtering thresholds and/or the number of expected groups in the group_mappoly function.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel I carried on changing the filtering thresholds trying to get more markers for the groups. I changed the, Filtering dataset by marker, argument up to 0.5: dat <- filter_missing(input.data = my_data, type = "marker", filter.thres = 0.5, inter = TRUE)

It helped to increase the number. But the segregation test, with the following argument is dropping the numbers again. pval.bonf <- 0.05/dat$n.mrk mrks.chi.filt <- filter_segregation(my_data, chisq.pval.thres = pval.bonf, inter = TRUE)

After this command, my number of markers drop from up to 15000 to 516. I even tried to go from 0.0001 to 0.5, and it didn't change much.

What could be the cause of this drastic filtering of markers on this command? Also, when I got 800 markers at the most, I got 600 for the first group and the rest divided for the other groups: group n.mrk 1 689 2 14 3 8 4 4 5 6 6 4 7 13 8 21 9 8 10 1 11 1 12 4 13 5 14 5 15 1 16 2 17 5 18 4 19 5 20 2 21 1 Is this related to the fact that I took just the first markers of my dada set, or shouldn't it be related to it? Rplot01

gabrielgesteira commented 3 years ago

Hello Vianei,

Are you still using a subset of the markers? If so, this might be the reason for almost all of them being grouped together. Since many markers end up being filtered out, you can try to load the whole dataset, filter them and see if this pattern persists. If you cannot load the whole dataset at once, try to randomly sample markers in the dataset instead of getting the first ones, so you can pick up markers from different chromosomes. Regarding the drastic reduction of markers in the segregation test, I suppose it might be related to the way genotypes were estimated. Assuming the population derived from heterozygous parents, your dataset could have markers presenting any combination between 0,1 and 2 parental genotypes, with different expected segregation for each one of them. However, when using only 0 and 1 genotypes, there is a restriction where only 0-1 and 1-0 genotype combinations between parents can be used to build the map, which would be similar to a pseudo-testcross (assuming the reduction does not collapse genotypic classes). In this case, the 0-0 combination is not informative and therefore eliminated, and the 1-1 combination might be eliminated too due to segregation distortion, since the dataset doesn't contain any individual with genotype 2. If you have the sequence data or the read depths available, it might be a good idea to try to re-estimate the genotypes and check if more markers are retained after the segregation test.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel About my dada set, I thoughtI it too, and I tock random markers from begining to end. But the results where not much diferent. Now I got 700 markers after the segregation test, iniciating in 60000 markers. Rplot01 group n.mrk 1 13 2 296 3 21 4 246 5 12 6 7 7 13 8 16 9 8 10 8 11 30 12 5 13 5 14 5 15 1 16 2 17 11 18 2 19 2 20 1 21 1 About my parents, they are 100% homozygous. But having a quick look over my data set, they are not that polymorphic as you would spect from a biparental population parents. They are actually contrasting for Beta-glucan content. So I suppose that could be a problem, the lack of polymorphism. Also, I am working with 21 linkage groups, I could start to play with this number too, to see if I got different results.

Cheers

qyannie commented 3 years ago

Hello Gabriel

I´m planning to use MAPpoly for construct genetic map. I got the following error message from that command:

z <- as.numeric(colnames(grs$seq.vs.grouped.snp)[1:11]) LGS<-vector("list", 11) for(j in 1:11){ temp1 <- make_seq_mappoly(grs, j) tpt <- make_pairs_mappoly(all.rf.pairwise, input.seq = temp1) temp2 <- rf_snp_filter(input.twopt = tpt, diagnostic.plot = FALSE) lgtemp <- get_genomic_order(temp2) LGS[[z[j]]] <- list(seq = make_seq_mappoly(lgtemp), tpt = tpt, ch = z[j]) } is giving me some Warning message: Error in LGS[[z[j]]] <- list(seq = make_seq_mappoly(lgtemp), tpt = tpt, : attempt to select more than one element in integerOneIndex

I don't known why.Is it a command error?

Best wishes

gabrielgesteira commented 3 years ago

Dear Yan,

Thank you for your interest in using MAPpoly. I see that some users are facing issues when running the example code from the tutorial. Please try running the following loop instead:

LGS<-vector("list", 11)
for(j in 1:11){
    temp1 <- make_seq_mappoly(grs, j, genomic.info = 1)
    tpt <- make_pairs_mappoly(all.rf.pairwise, input.seq = temp1)
    temp2 <- rf_snp_filter(input.twopt = tpt, diagnostic.plot = FALSE)
    lgtemp <- get_genomic_order(temp2)
    LGS[[j]] <- list(seq = make_seq_mappoly(lgtemp), tpt = tpt)
}

This basically drops the z and ch objects, so it avoids issues in the list indexing when the reference sequence names are not numbers.

Please let us know if this fixes the issue for your dataset.

Best regards,

VianeiRother commented 3 years ago

Hello Gabriel

As I wrote before, I´m ending with a really low number of markers. And some linkage groups end up with just one. So I´m considering using an imputation command to improve the number of viable markers. Could the a.mat() from rrBRUL do this job? Or is there a command already doing this?

Cheers

gabrielgesteira commented 3 years ago

Hello @VianeiRother,

Sorry for taking that long to get back to you. We don't recommend imputating the dataset to build linkage maps, since the HMM chain already does that during the mapping process (it uses more information to estimate genotypes when compared to common imputation methods). I agree that the lack of polymorphism may be a problem, but still think the issue comes from the way genotypes were estimated in your population. I would recommend trying to re-estimate the genotypes using a different approach, then checking if more markers are retained after the segregation test. You can find software suggestions to perform that in the Related Software section. Please let us know if that helped to solve the issue with your dataset.

Best regards,