Closed MichaelCummingsBCH closed 4 years ago
Thanks for providing a detailed bug report!
From the naming of your variables, I assume you used hapmap as the reference dataset? Could you share the names of the reference populations?
I used all of the reference populations, using their three-letter names as referred to here: https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html
From the error messages I suspect the function detects no European reference samples. I have recently worked on a patch for an unrelated issue, during which I added some additional functionality to the ancestry_filter. Would you mind installing the development version via:
library(devtools)
install_github("meyer-lab-cshl/plinkQC", ref="dev" )
In there, you can manually specify the name of the populations - did you keep them upper case/lower case? With this functionality it should tell us if this error arises due to naming issues (which seems unlikely, but worth a try).
What is the new process for manually specifiying the names in the development version? I assume this is not present in the documentation on CRAN for the older version. I apologize since this is a rather simple question but I couldn't find the information on github.
Once you installed the development branch version as indicated above, you can use the standard help function with:
?evaluate_check_ancestry
And you will see that there is a new parameter refPopulations
which let’s you specify the abbreviations of the populations you want to use as the European reference samples.
Ok, I added the parameter, but under the conditions of both
refPopulation = c("CEU", "TSI"), and also just refPopulation = "CEU"
I receive an identical error.
Actually, hold on, I see a disrepancy in my data that I need to fix. I will get back if the error persists.
I can you run the function in debug mode?
debug(evaluate_check_ancestry)
And then run the function as you did before.
You will then be able to run it step-by-step.
Once it gets to the Euclidean distance, can you get str
of
the following objects:
all_european, euro_pc1_mean, euro_pc2_mean
and show here?
Thanks
I don't think this will work - I just discovered that my hapmap files I used for the merge before this step contained errors. I have to look at them first. Thank you for your help, I was able to discover what was causing the problem.
Here is the error I get with debugging, for the record. I am actually going to leave this open for now since I still have the error, but I need to talk to some people on my end first.
Error in (function (srcref) : unimplemented type (29) in 'eval' Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29' In addition: Warning message: type 29 is unimplemented in 'type2char'
Thanks for re-opening - whatever the issue with the data is, it would be good for the function to catch it and return an appropriate error message.
Okay, I fixed the problem I had found outside of this, which actually ended up being not that major, but after that the same error recurs.
Here's the error, it's the same as before:
Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf
Here is the initial result of the debug(evaluate_check_ancestry), also same as before
Error in (function (srcref) : unimplemented type (29) in 'eval' Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29' In addition: Warning message: type 29 is unimplemented in 'type2char'
I went into debug mode step-by-step.
euro_pc1_mean and euro_pc2_mean both have value NaN. all_european does appear in my environment; it has 25 variables, but they all appear to be empty. And using > View(all_european) gets the following error:
View(all_european) Error in View : object 'all_european' not found
I saw these warning messages debugging:
Warning message:
select_()
is deprecated as of dplyr 0.7.0.
Please use select()
instead.
Warning message:
filter_()
is deprecated as of dplyr 0.7.0.
Please use filter()
instead.
See vignette('programming') for more help
The error messages you provided indicate to me that there is a type error in the data provided.
Error: INTEGER() can only be applied to a 'integer', not a 'unknown type #29'
When you do the step-by-step debug, after you ran the line
all_european <- dplyr::filter_(data_all, ~Pop %in% c("CEU", "TSI"))
what does str(all_european)
look like?
I know you said the genotype data cannot be shared, but could you share the eigenvec file? I could have a look at my end and see if I can spot anything in the data that might be causing it or why the function cannot deal with it.
Are you using Rstudio? If so, which version? the Error in (function (srcref) : unimplemented type (29) in 'eval'
part of your error has been described by other people here.
I have the .eigenvec attached here:
I looked into that error description. That one occurs before the debugging. When I updated my Rstudio, this line
Error in (function (srcref) : unimplemented type (29) in 'eval'
no longer appears at the beginning of the debug. The actual function outside of debug mode has the same error behavior, but the line doesn't get in the way of debug anymore. The errors I find during debug are still present as well.
I'm now using refPopulation = "CEU", I can specify refPopulation with the new instruction, as you requested earlier. There are no TSI in my file. I know earlier I told you that I was using all of the populations. That was what I fully thought and had been told. But when I had to change some things behind the scenes I discovered that my reference file was actually using only four: YRI, CEU, CHB, and JPT. So I can't use TSI.
The debugging line that the process sends to me looks like this now: debug: alleuropean <- dplyr::filter(data_all, ~Pop %in% refPopulation)
refPopulation shows up as what I set it as: Browse[2]> str(refPopulation) chr "CEU"
As for str(all_european), it is empty:
'data.frame': 0 obs. of 24 variables: $ IID : chr $ FID : chr $ PC1 : num $ PC2 : num $ PC3 : num $ PC4 : num $ PC5 : num $ PC6 : num $ PC7 : num $ PC8 : num $ PC9 : num $ PC10 : num $ PC11 : num $ PC12 : num $ PC13 : num $ PC14 : num $ PC15 : num $ PC16 : num $ PC17 : num $ PC18 : num $ PC19 : num $ PC20 : num $ Pop : Factor w/ 1 level "all_combined_with_hapmap": $ Color: chr
Thanks. Can you also send me these files:
refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt", sep=""),
refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""),
?
all_european
(and the data.frame all_data
that it originates) from should have the population codes in the Pop
column, but here it looks like that column only had one entry, namely "all_combined_with_hapmap"
. As this is not the same as CEU
the all_european
data.frame is empty. I think we are getting to the bottom of this. If you could send me the files above and I'll run your command I think we'll find it soon.
Here they are (I changed the names of them to make sure I don't confuse them with your example files.)
hapmap_pop_specified.txt HapMap_colors_specified.txt
For reference, here's the new command I've been running the past few times:
ancestry_check <- evaluate_check_ancestry(indir=indir, name=name, prefixMergedDataset=prefixMergedDataset, refSamplesFile=paste(indir, "/hapmap_pop_specified.txt", sep=""), refColorsFile=paste(indir, "/HapMap_colors_specified.txt", sep=""), refPopulation = "CEU", interactive=TRUE)
This is the plot of your four reference populations and the specified colors.
This file all_pca.eigenvec
only contains data from the reference populations (270 samples) but not any other data. Did you send me a smaller subset or is this the dataset you have been trying to analyse?
I've just added checks and tests for cases with no non-reference samples in the .eigenvec file. Can you restart your R session and install the latest plinkQC version from the dev branch
library(devtools)
install_github("meyer-lab-cshl/plinkQC", ref="dev" )
and see if you now get a more useful error message?
I thought those files included the experimental data. Right now I am going straight back to when those files were created to see if anything was inadvertently left out.
Ok, I have the new version of the .eigenvec here.
Now, it still produced the same error as before.
Then, I installed the new plinkQC version. This produced a new error:
Error in evaluate_check_ancestry(indir = indir, name = name, prefixMergedDalitaset = prefixMergedDataset, : unused argument (prefixMergedDalitaset = prefixMergedDataset)
I think the new error is what we wanted, so this is good.
I think there is just a typo in the command you are running:
unused argument (prefixMergedDalitaset = prefixMergedDataset)
you are supplying prefixMergedDalitasetinstead of
prefixMergedDataset`. Can you see what you get when you fix the typo?
Okay, now it's the same error again. I must have introduced that typo very recently.
Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf
Are you sure you re-installed the package?
I managed to run it with the new eigenvec file that you provided:
I have just updated the version number of in the dev branch, in case that hindered the proper updating. Give the download, loading into R and re-running another go.
How are you able to run it without the binary files?
Could you share your full script with me?
it only needs the .fam file - so I just greped the non-hapmap IDs from the .eigenvec file and wrote those to the .fam file.
Could you send me your .R file and your .fam file? I updated again, nothing changed. I think I may have a mistake in my code, but I need to run your confirmed working version to be sure.
I am running the command that you pasted at the very beginning of this thread:
indir <- "~/Downloads"
name <- 'all_combined_with_hapmap'
prefixMergedDataset <- 'all_pca'
ancestry_check <- evaluate_check_ancestry(indir=indir, name=name,
prefixMergedDataset=prefixMergedDataset,
refSamplesFile=paste(indir, "/hapmap_pop_specified.txt", sep=""),
refColorsFile=paste(indir, "/HapMap_colors_specified.txt", sep=""),
refPopulation = "CEU",
interactive=TRUE)
The samples file I created is really simply based on your .eigenvec file file (gzipped here, so I could upload):
all_combined_with_hapmap.fam.gz
Can you show the new output of sessionInfo()
? I wonder if your devtools install worked.
I was able to get the image I wanted!
By replacing the .fam, it worked. I don't know why my .fam wasn't working, even though it was the binary file that the .eigenvec was created from. I am looking closer right now and I think PLINK may have dome something to some of the individuals, removing or altering them.
However, this is what I wanted. I already know how to get those two columns using grep.
Thank you so much for your help! I would not have been able to solve the problem without this discussion.
Your .fam file has '@' symbols in the IIDs, I wonder if PLINK struggles with that? Worth checking if removing those will solve that problem.
Yes, I will test on that; I don't want to run into this problem in the future. But for now I have this workaround.
From the error messages I suspect the function detects no European reference samples. I have recently worked on a patch for an unrelated issue, during which I added some additional functionality to the ancestry_filter. Would you mind installing the development version via:
library(devtools) install_github("meyer-lab-cshl/plinkQC", ref="dev" )
In there, you can manually specify the name of the populations - did you keep them upper case/lower case? With this functionality it should tell us if this error arises due to naming issues (which seems unlikely, but worth a try).
Hi sorry to reopen this, but I do have the same problem and was wondering whether these changes have been implemented in the last plinkQC version? or shall I use the ref="dev"? Also I've run my -pca on my data only (not merged with ref population) - does that make a difference? Thank you for your help, this is super useful for someone who just started this type of analysis.
I was just getting the same error, when working with the 1000G Phase 3 reference panel and the super population notations that the consortium provides for the samples (AFR, AMR, EAS, EUR, SAS). I noticed that evaluate_check_ancestry
needs colors assigned to CEU and TSI in refColorsFile
, even if individuals are not labelled as these. But since the error continued, I renamed all EUR to CEU (in refColorsFile
and refSamplesFile
) and now the plot shows up!
Describe the bug I have an unusual error using the plinkQC package that does not have any parallel I can see in the documentation. I am trying to use the evaluate_check_ancestry function. I am running R 4.0.2, the most recently released version. None of the variables in the error is anywhere in my code, and I believe the error involves something internal to the package. Here’s the code:
To Reproduce library(plinkQC)
indir <- getwd() name <- 'all_combined_with_hapmap' prefixMergedDataset <- 'all_pca'
ancestry_check <- evaluate_check_ancestry(indir=indir, name=name, prefixMergedDataset=prefixMergedDataset, refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt", sep=""), refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""), interactive=TRUE)
Expected behavior I expected to use the evaluate_check_ancestry function to provide similar results to the command on page 4 here: https://cran.rstudio.com/web/packages/plinkQC/vignettes/AncestryCheck.pdf
Error messages Error in seq_len(n) : argument must be coercible to non-negative integer In addition: Warning messages: 1: In max(all_european$euclid_dist) : no non-missing arguments to max; returning -Inf 2: Removed 1 rows containing non-finite values (stat_circle). 3: In max(f) : no non-missing arguments to max; returning -Inf
Please complete the following information):
R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS High Sierra 10.13.6
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] plinkQC_0.3.0
loaded via a namespace (and not attached): [1] Rcpp_1.0.4.6 rstudioapi_0.11 magrittr_1.5 MASS_7.3-51.6 getopt_1.20.3 tidyselect_1.1.0 munsell_0.5.0 cowplot_1.0.0
[9] colorspace_1.4-1 R6_2.4.1 rlang_0.4.6 plyr_1.8.6 dplyr_1.0.0 tools_4.0.2 grid_4.0.2 ggforce_0.3.2
[17] data.table_1.12.8 gtable_0.3.0 ellipsis_0.3.1 tibble_3.0.1 optparse_1.6.6 lifecycle_0.2.0 crayon_1.3.4 gridExtra_2.3
[25] farver_2.0.3 tweenr_1.0.1 purrr_0.3.4 ggplot2_3.3.2 vctrs_0.3.1 glue_1.4.1 polyclip_1.10-0 UpSetR_1.4.0
[33] compiler_4.0.2 pillar_1.4.4 generics_0.0.2 scales_1.1.1 pkgconfig_2.0.3
Additional context I emailed about this before posting the information here. The data is private to the lab; we can't share it.