petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Unable to plot PCs of the corners #13

Closed jguhlin closed 4 years ago

jguhlin commented 4 years ago

Receiving this error and not sure why. Any ideas?

`Here are all pairwise plots of the first 4 PCs for each of the three corners:

layout(t(1:3)) for (i in 1:(corner.npc-1)) { for (j in (i+1):corner.npc) { for (k in 1:ncol(mds.corners)) { vectors <- matrix( corner.pca[[k]][-(1:(1+corner.npc))], ncol=corner.npc )[,c(i,j)] colnames(vectors) <- paste("PC", c(i,j)) par(mgp=c(0.7,0.7,0), mar=c(2,2,2,0)+.1) plot(vectors, pch=pop.pch[samps$population], col=pop.cols[samps$population], xaxt='n', yaxt='n' ) if (i==1 && j==2) { mtext(paste("corner",k),side=3) } } if (do.pdfs) { pdfcopy(plot.id=paste(i,j,sep="")) } } }

Error in matrix(corner.pca[[k]][-(1:(1 + corner.npc))], ncol = corner.npc): 'data' must be of a vector type, was 'NULL'`

petrelharp commented 4 years ago

Hm - well, something went wrong earlier on; my guess is that something went wrong at computing the covariance matrices for the corners, here: https://github.com/petrelharp/local_pca/blob/master/templated/summarize_run.Rmd#L206

I'm guessing that this didn't produce an html file to look at? You could (temporarily) delete this last code chunk and run it; then it should produce output with plots, and you'll probably get clues earlier on what went wrong. If it's not obvious to you, could you post the output?

jguhlin commented 4 years ago

Ah, it's RMarkdown isn't it? I can give it a go in Rstudio. It did produce an HTML file, with that error as the end of it. It wasn't obvious to me what the problem was but I'll do some digging today.

jguhlin commented 4 years ago

Here is where I'm getting the error. It looks like it is passing NA as the file to open, even though bcf.files is defined:

bcf.files [1] "/home/josephguhlin/nesi/_/misc_analyses/lostruct/local_pca/templated/data/gwas_set_numeric_maf_0.05_f_missing_0.2.bcf"

Any ideas?

`> qfun <- multi_vcf_query_fn( chrom.list=chroms, file=bcf.files, regions=reg )

corner.covmats[[k]] <- running_cov(qfun,1:nrow(reg), normalize.rows=TRUE) Taking input= as a system command ('bcftools query -f '[ %GT]\n' -r 4:42813772-44523390 NA') and a variable has been used in the expression passed to input=. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message. [E::hts_open_format] Failed to open file "NA" : No such file or directory Failed to read from NA: No such file or directory Error in colMeans(x, na.rm = TRUE) : 'x' must be an array of at least two dimensions`

jguhlin commented 4 years ago

Sorry for all the updates. I think it requires chromosomes in individual files, so I'm splitting them now and going to re-run.

jguhlin commented 4 years ago

It does work like this, although very small scaffolds/contigs have to be deleted or it won't run (this is fine).

The ordering is unusual, as the remaining chromosomes are 1-30, and it puts them in order of 1, 11, 12, 13... but the space it reserves looks like it is doing it in numerical order. Not sure if this is something easy to fix or not, I'm going to use the code to plot it manually by individual chr.

Cheers, --Joseph

image

petrelharp commented 4 years ago

Hm, interesting. What's that last big region? Is it a chromosome? That's where all the action is on this plot.

The chromosome ordering in the plotting script comes in through the order of regions.files, I think, here: https://github.com/petrelharp/local_pca/blob/master/templated/summarize_run.Rmd#L87 - so, to specify a different order, you could insert code after that line that sorts them the way you want (I've not checked, though, so not sure). I agree that plotting them separately is a good idea, to be sure what's going on.

jguhlin commented 4 years ago

I believe it is multiple chromosomes overlaid together. When run with a single BCF (which gives the error but generates this plot) it looks like a few chromosomes are giving the corners. I'll take a look at reordering it, I do have it plotting individually now though.

Cheers,

image

petrelharp commented 4 years ago

Oh, this is very interesting. I haven't seen anything where separate chromosomes show separate PC patterns like this before. (maybe one is a sex chromosome?)

If you figure out the fix to the Rmd to make those last plots with a single BCF, please let me know - I'd like to merge it in.

jguhlin commented 4 years ago

Yeah, it's very interesting. They are not the sex chromosomes, surprisingly, and are very high-confidence SNP calls (very low Mendelian inheritance errors). It is a smaller population, however. If you've got any suggestions on how to look further into this I'd be interested in hearing your thoughts.

There is patterns within the chromosomes as well. I'm going to drop these three to look at other patterns in the population.

It looks like using local kinship matrices in GWAS can really cut down on spurious p-values as well, although how to do it properly and if it works is still something I'm digging into.

petrelharp commented 4 years ago

If you've got any suggestions on how to look further into this I'd be interested in hearing your thoughts.

Well, my first thought is to look at what the PCs look like on those chromosomes, and see what that turns up? One option, I suppose, is that just linkage is so strong that there's chromosome-wide shared patterns.

It looks like using local kinship matrices in GWAS can really cut down on spurious p-values as well, although how to do it properly and if it works is still something I'm digging into.

Oh, great! Glad to hear someone is investigating this.

jguhlin commented 4 years ago

Thanks, the population I'm working with is very small with lots of structure (endangered). I've found local kinship matrices with just 1000 SNPs along a sliding window seems to help, but I feel a more formalized approach would be better. If you are interested in helping I'd definitely be open to hearing your thoughts.

As for the extremes here, they do increase with more PCs and MDS components, generating a multidimensional "star" pattern. Looking at the structure of populations along the chromosome from fastStructure output confirms something is going on with allele frequencies, so it does look like a real signal.

jguhlin commented 4 years ago

Hey, hope you are doing well. Pardon me for using a github issue as a way to communicate. Happy to send an email as well.

I converted the bulk of lostruct into python, and getting the same results (minus a few deep decimal points, due to floating point issues.. R=0.99999983. or thereabouts).

I saw l2 norm in the code, and was curious if you'd ever used it before? Or was it just exploratory?

Thanks.

petrelharp commented 4 years ago

I converted the bulk of lostruct into python

Oh, nice! I'd like to link to that here, if you've got it on github - people might like to use it!

I saw l2 norm in the code, and was curious if you'd ever used it before? Or was it just exploratory?

We compared it on a few examples and got similar results... so yes, pretty much just exploratory. I haven't got a strong reason why one would be better than the other.

jguhlin commented 4 years ago

Yeah, let me clean it up a bit. Is it ok if I call it lostruct-py? Otherwise I'll give it another name.

Cool, thanks for letting me know regarding the L2 norm. I've been digging into different GRM preparation methods, so got curious about different things for lostruct as well, but it's probably a bit too much out of my expertise.

petrelharp commented 4 years ago

Please do. I'm also happy to host it in this repo, and make you a co-maintainer, if you want - either way.

jguhlin commented 4 years ago

Take a look: https://github.com/jguhlin/lostruct-py

Let me know if you have any issues / concerns.

petrelharp commented 4 years ago

Great! It looks fantastic! I'm making some minor comments over there. It'd be nice to get some unit tests in before you release it.

petrelharp commented 4 years ago

I've put a link in the README, but calling it a "prototype" for now; bug me to change that if you feel it's ready for a different adjective.

jguhlin commented 4 years ago

Prototype sounds good.

I've added unit tests, probably not quite what you were thinking but fits the bill (for now). Open to ideas. I also created a Zenodo DOI, please double check your info: https://zenodo.org/record/3997106

I'll be adding in weights and unit tests for those soonish. Would be good to test with our 2 populations.

It's pretty stable, and I'm able to use it for looking into an endangered species population and get insight into conservation breeding.

Feel free to reach out anytime, and I'm sure I'll bug you with more questions in the future.

Cheers, --Joseph

petrelharp commented 4 years ago

The DOI checks out. I'll close this issue and move discussion over to that repo. Thanks!

It's pretty stable, and I'm able to use it for looking into an endangered species population and get insight into conservation breeding.

Oh, that's great. Looking forward to hearing about it.