meyer-lab-cshl / plinkQC

R package for quality control of plink genetic datasets
Other
55 stars 28 forks source link

Unknown error using the evaluate_check_ancestry function #28

Closed MichaelCummingsBCH closed 4 years ago

MichaelCummingsBCH commented 4 years ago

Describe the bug I have an unusual error using the plinkQC package that does not have any parallel I can see in the documentation. I am trying to use the evaluate_check_ancestry function. I am running R 4.0.2, the most recently released version. None of the variables in the error is anywhere in my code, and I believe the error involves something internal to the package. Here’s the code:

To Reproduce library(plinkQC)

indir <- getwd() name <- 'all_combined_with_hapmap' prefixMergedDataset <- 'all_pca'

ancestry_check <- evaluate_check_ancestry(indir=indir, name=name, prefixMergedDataset=prefixMergedDataset, refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt", sep=""), refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""), interactive=TRUE)

Expected behavior I expected to use the evaluate_check_ancestry function to provide similar results to the command on page 4 here: https://cran.rstudio.com/web/packages/plinkQC/vignettes/AncestryCheck.pdf

Error messages Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf

Please complete the following information):

R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS High Sierra 10.13.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] plinkQC_0.3.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.4.6 rstudioapi_0.11 magrittr_1.5 MASS_7.3-51.6 getopt_1.20.3 tidyselect_1.1.0 munsell_0.5.0 cowplot_1.0.0
[9] colorspace_1.4-1 R6_2.4.1 rlang_0.4.6 plyr_1.8.6 dplyr_1.0.0 tools_4.0.2 grid_4.0.2 ggforce_0.3.2
[17] data.table_1.12.8 gtable_0.3.0 ellipsis_0.3.1 tibble_3.0.1 optparse_1.6.6 lifecycle_0.2.0 crayon_1.3.4 gridExtra_2.3
[25] farver_2.0.3 tweenr_1.0.1 purrr_0.3.4 ggplot2_3.3.2 vctrs_0.3.1 glue_1.4.1 polyclip_1.10-0 UpSetR_1.4.0
[33] compiler_4.0.2 pillar_1.4.4 generics_0.0.2 scales_1.1.1 pkgconfig_2.0.3

Additional context I emailed about this before posting the information here. The data is private to the lab; we can't share it.

HannahVMeyer commented 4 years ago

Thanks for providing a detailed bug report!

From the naming of your variables, I assume you used hapmap as the reference dataset? Could you share the names of the reference populations?

MichaelCummingsBCH commented 4 years ago

I used all of the reference populations, using their three-letter names as referred to here: https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html

HannahVMeyer commented 4 years ago

From the error messages I suspect the function detects no European reference samples. I have recently worked on a patch for an unrelated issue, during which I added some additional functionality to the ancestry_filter. Would you mind installing the development version via:

library(devtools)
install_github("meyer-lab-cshl/plinkQC", ref="dev" )

In there, you can manually specify the name of the populations - did you keep them upper case/lower case? With this functionality it should tell us if this error arises due to naming issues (which seems unlikely, but worth a try).

MichaelCummingsBCH commented 4 years ago

What is the new process for manually specifiying the names in the development version? I assume this is not present in the documentation on CRAN for the older version. I apologize since this is a rather simple question but I couldn't find the information on github.

HannahVMeyer commented 4 years ago

Once you installed the development branch version as indicated above, you can use the standard help function with:

?evaluate_check_ancestry

And you will see that there is a new parameter refPopulations which let’s you specify the abbreviations of the populations you want to use as the European reference samples.

MichaelCummingsBCH commented 4 years ago

Ok, I added the parameter, but under the conditions of both

refPopulation = c("CEU", "TSI"), and also just refPopulation = "CEU"

I receive an identical error.

MichaelCummingsBCH commented 4 years ago

Actually, hold on, I see a disrepancy in my data that I need to fix. I will get back if the error persists.

HannahVMeyer commented 4 years ago

I can you run the function in debug mode?

debug(evaluate_check_ancestry)

And then run the function as you did before. You will then be able to run it step-by-step. Once it gets to the Euclidean distance, can you get str of the following objects: all_european, euro_pc1_mean, euro_pc2_mean and show here?

Thanks

MichaelCummingsBCH commented 4 years ago

I don't think this will work - I just discovered that my hapmap files I used for the merge before this step contained errors. I have to look at them first. Thank you for your help, I was able to discover what was causing the problem.

MichaelCummingsBCH commented 4 years ago

Here is the error I get with debugging, for the record. I am actually going to leave this open for now since I still have the error, but I need to talk to some people on my end first.

Error in (function (srcref) : unimplemented type (29) in 'eval' Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29' In addition: Warning message: type 29 is unimplemented in 'type2char'

HannahVMeyer commented 4 years ago

Thanks for re-opening - whatever the issue with the data is, it would be good for the function to catch it and return an appropriate error message.

MichaelCummingsBCH commented 4 years ago

Okay, I fixed the problem I had found outside of this, which actually ended up being not that major, but after that the same error recurs.

Here's the error, it's the same as before:

Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf

Here is the initial result of the debug(evaluate_check_ancestry), also same as before

Error in (function (srcref) : unimplemented type (29) in 'eval' Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29' In addition: Warning message: type 29 is unimplemented in 'type2char'

I went into debug mode step-by-step.

euro_pc1_mean and euro_pc2_mean both have value NaN. all_european does appear in my environment; it has 25 variables, but they all appear to be empty. And using > View(all_european) gets the following error:

View(all_european) Error in View : object 'all_european' not found

I saw these warning messages debugging:

Warning message: select_() is deprecated as of dplyr 0.7.0. Please use select() instead.

Warning message: filter_() is deprecated as of dplyr 0.7.0. Please use filter() instead. See vignette('programming') for more help

HannahVMeyer commented 4 years ago

The error messages you provided indicate to me that there is a type error in the data provided.

Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29'

When you do the step-by-step debug, after you ran the line

 all_european <- dplyr::filter_(data_all, ~Pop %in% c("CEU", "TSI"))

what does str(all_european) look like?

I know you said the genotype data cannot be shared, but could you share the eigenvec file? I could have a look at my end and see if I can spot anything in the data that might be causing it or why the function cannot deal with it.

Are you using Rstudio? If so, which version? the Error in (function (srcref) : unimplemented type (29) in 'eval' part of your error has been described by other people here.

MichaelCummingsBCH commented 4 years ago

I have the .eigenvec attached here:

all_pca.eigenvec.zip

I looked into that error description. That one occurs before the debugging. When I updated my Rstudio, this line

Error in (function (srcref) : unimplemented type (29) in 'eval'

no longer appears at the beginning of the debug. The actual function outside of debug mode has the same error behavior, but the line doesn't get in the way of debug anymore. The errors I find during debug are still present as well.

I'm now using refPopulation = "CEU", I can specify refPopulation with the new instruction, as you requested earlier. There are no TSI in my file. I know earlier I told you that I was using all of the populations. That was what I fully thought and had been told. But when I had to change some things behind the scenes I discovered that my reference file was actually using only four: YRI, CEU, CHB, and JPT. So I can't use TSI.

The debugging line that the process sends to me looks like this now: debug: alleuropean <- dplyr::filter(data_all, ~Pop %in% refPopulation)

refPopulation shows up as what I set it as: Browse[2]> str(refPopulation) chr "CEU"

As for str(all_european), it is empty:

'data.frame': 0 obs. of 24 variables: $ IID : chr $ FID : chr $ PC1 : num $ PC2 : num $ PC3 : num $ PC4 : num $ PC5 : num $ PC6 : num $ PC7 : num $ PC8 : num $ PC9 : num $ PC10 : num $ PC11 : num $ PC12 : num $ PC13 : num $ PC14 : num $ PC15 : num $ PC16 : num $ PC17 : num $ PC18 : num $ PC19 : num $ PC20 : num $ Pop : Factor w/ 1 level "all_combined_with_hapmap": $ Color: chr

HannahVMeyer commented 4 years ago

Thanks. Can you also send me these files:

refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt", sep=""),
refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""),

?

all_european (and the data.frame all_data that it originates) from should have the population codes in the Pop column, but here it looks like that column only had one entry, namely "all_combined_with_hapmap". As this is not the same as CEU the all_european data.frame is empty. I think we are getting to the bottom of this. If you could send me the files above and I'll run your command I think we'll find it soon.

MichaelCummingsBCH commented 4 years ago

Here they are (I changed the names of them to make sure I don't confuse them with your example files.)

hapmap_pop_specified.txt HapMap_colors_specified.txt

For reference, here's the new command I've been running the past few times:

ancestry_check <- evaluate_check_ancestry(indir=indir, name=name, prefixMergedDataset=prefixMergedDataset, refSamplesFile=paste(indir, "/hapmap_pop_specified.txt", sep=""), refColorsFile=paste(indir, "/HapMap_colors_specified.txt", sep=""), refPopulation = "CEU", interactive=TRUE)

HannahVMeyer commented 4 years ago

This is the plot of your four reference populations and the specified colors.

image

This file all_pca.eigenvec only contains data from the reference populations (270 samples) but not any other data. Did you send me a smaller subset or is this the dataset you have been trying to analyse?

HannahVMeyer commented 4 years ago

I've just added checks and tests for cases with no non-reference samples in the .eigenvec file. Can you restart your R session and install the latest plinkQC version from the dev branch

library(devtools)
install_github("meyer-lab-cshl/plinkQC", ref="dev" )

and see if you now get a more useful error message?

MichaelCummingsBCH commented 4 years ago

I thought those files included the experimental data. Right now I am going straight back to when those files were created to see if anything was inadvertently left out.

MichaelCummingsBCH commented 4 years ago

Ok, I have the new version of the .eigenvec here.

all_pca.eigenvec.zip

Now, it still produced the same error as before.

Then, I installed the new plinkQC version. This produced a new error:

Error in evaluate_check_ancestry(indir = indir, name = name, prefixMergedDalitaset = prefixMergedDataset, : unused argument (prefixMergedDalitaset = prefixMergedDataset)

I think the new error is what we wanted, so this is good.

HannahVMeyer commented 4 years ago

I think there is just a typo in the command you are running:

unused argument (prefixMergedDalitaset = prefixMergedDataset)

you are supplying prefixMergedDalitasetinstead ofprefixMergedDataset`. Can you see what you get when you fix the typo?

MichaelCummingsBCH commented 4 years ago

Okay, now it's the same error again. I must have introduced that typo very recently.

Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf

HannahVMeyer commented 4 years ago

Are you sure you re-installed the package?

I managed to run it with the new eigenvec file that you provided: image

I have just updated the version number of in the dev branch, in case that hindered the proper updating. Give the download, loading into R and re-running another go.

MichaelCummingsBCH commented 4 years ago

How are you able to run it without the binary files?

MichaelCummingsBCH commented 4 years ago

Could you share your full script with me?

HannahVMeyer commented 4 years ago

it only needs the .fam file - so I just greped the non-hapmap IDs from the .eigenvec file and wrote those to the .fam file.

MichaelCummingsBCH commented 4 years ago

Could you send me your .R file and your .fam file? I updated again, nothing changed. I think I may have a mistake in my code, but I need to run your confirmed working version to be sure.

HannahVMeyer commented 4 years ago

I am running the command that you pasted at the very beginning of this thread:

indir <- "~/Downloads"
name <- 'all_combined_with_hapmap'
prefixMergedDataset <- 'all_pca'

ancestry_check <- evaluate_check_ancestry(indir=indir, name=name,
                                          prefixMergedDataset=prefixMergedDataset,
                                          refSamplesFile=paste(indir, "/hapmap_pop_specified.txt", sep=""),
                                          refColorsFile=paste(indir, "/HapMap_colors_specified.txt", sep=""),
                                          refPopulation = "CEU",
                                          interactive=TRUE)

The samples file I created is really simply based on your .eigenvec file file (gzipped here, so I could upload):

all_combined_with_hapmap.fam.gz

Can you show the new output of sessionInfo()? I wonder if your devtools install worked.

MichaelCummingsBCH commented 4 years ago

I was able to get the image I wanted!

Rplot

By replacing the .fam, it worked. I don't know why my .fam wasn't working, even though it was the binary file that the .eigenvec was created from. I am looking closer right now and I think PLINK may have dome something to some of the individuals, removing or altering them.

However, this is what I wanted. I already know how to get those two columns using grep.

Thank you so much for your help! I would not have been able to solve the problem without this discussion.

HannahVMeyer commented 4 years ago

Your .fam file has '@' symbols in the IIDs, I wonder if PLINK struggles with that? Worth checking if removing those will solve that problem.

MichaelCummingsBCH commented 4 years ago

Yes, I will test on that; I don't want to run into this problem in the future. But for now I have this workaround.

brkicd commented 3 years ago

From the error messages I suspect the function detects no European reference samples. I have recently worked on a patch for an unrelated issue, during which I added some additional functionality to the ancestry_filter. Would you mind installing the development version via:

library(devtools)
install_github("meyer-lab-cshl/plinkQC", ref="dev" )

In there, you can manually specify the name of the populations - did you keep them upper case/lower case? With this functionality it should tell us if this error arises due to naming issues (which seems unlikely, but worth a try).

Hi sorry to reopen this, but I do have the same problem and was wondering whether these changes have been implemented in the last plinkQC version? or shall I use the ref="dev"? Also I've run my -pca on my data only (not merged with ref population) - does that make a difference? Thank you for your help, this is super useful for someone who just started this type of analysis.

rodrigoduarte88 commented 3 years ago

I was just getting the same error, when working with the 1000G Phase 3 reference panel and the super population notations that the consortium provides for the samples (AFR, AMR, EAS, EUR, SAS). I noticed that evaluate_check_ancestry needs colors assigned to CEU and TSI in refColorsFile, even if individuals are not labelled as these. But since the error continued, I renamed all EUR to CEU (in refColorsFile and refSamplesFile) and now the plot shows up!