Closed jguhlin closed 4 years ago
Hm - well, something went wrong earlier on; my guess is that something went wrong at computing the covariance matrices for the corners, here: https://github.com/petrelharp/local_pca/blob/master/templated/summarize_run.Rmd#L206
I'm guessing that this didn't produce an html file to look at? You could (temporarily) delete this last code chunk and run it; then it should produce output with plots, and you'll probably get clues earlier on what went wrong. If it's not obvious to you, could you post the output?
Ah, it's RMarkdown isn't it? I can give it a go in Rstudio. It did produce an HTML file, with that error as the end of it. It wasn't obvious to me what the problem was but I'll do some digging today.
Here is where I'm getting the error. It looks like it is passing NA as the file to open, even though bcf.files is defined:
bcf.files [1] "/home/josephguhlin/nesi/_/misc_analyses/lostruct/local_pca/templated/data/gwas_set_numeric_maf_0.05_f_missing_0.2.bcf"
Any ideas?
`> qfun <- multi_vcf_query_fn( chrom.list=chroms, file=bcf.files, regions=reg )
corner.covmats[[k]] <- running_cov(qfun,1:nrow(reg), normalize.rows=TRUE) Taking input= as a system command ('bcftools query -f '[ %GT]\n' -r 4:42813772-44523390 NA') and a variable has been used in the expression passed to
input=
. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message. [E::hts_open_format] Failed to open file "NA" : No such file or directory Failed to read from NA: No such file or directory Error in colMeans(x, na.rm = TRUE) : 'x' must be an array of at least two dimensions`
Sorry for all the updates. I think it requires chromosomes in individual files, so I'm splitting them now and going to re-run.
It does work like this, although very small scaffolds/contigs have to be deleted or it won't run (this is fine).
The ordering is unusual, as the remaining chromosomes are 1-30, and it puts them in order of 1, 11, 12, 13... but the space it reserves looks like it is doing it in numerical order. Not sure if this is something easy to fix or not, I'm going to use the code to plot it manually by individual chr.
Cheers, --Joseph
Hm, interesting. What's that last big region? Is it a chromosome? That's where all the action is on this plot.
The chromosome ordering in the plotting script comes in through the order of regions.files
, I think, here: https://github.com/petrelharp/local_pca/blob/master/templated/summarize_run.Rmd#L87 - so, to specify a different order, you could insert code after that line that sorts them the way you want (I've not checked, though, so not sure). I agree that plotting them separately is a good idea, to be sure what's going on.
I believe it is multiple chromosomes overlaid together. When run with a single BCF (which gives the error but generates this plot) it looks like a few chromosomes are giving the corners. I'll take a look at reordering it, I do have it plotting individually now though.
Cheers,
Oh, this is very interesting. I haven't seen anything where separate chromosomes show separate PC patterns like this before. (maybe one is a sex chromosome?)
If you figure out the fix to the Rmd to make those last plots with a single BCF, please let me know - I'd like to merge it in.
Yeah, it's very interesting. They are not the sex chromosomes, surprisingly, and are very high-confidence SNP calls (very low Mendelian inheritance errors). It is a smaller population, however. If you've got any suggestions on how to look further into this I'd be interested in hearing your thoughts.
There is patterns within the chromosomes as well. I'm going to drop these three to look at other patterns in the population.
It looks like using local kinship matrices in GWAS can really cut down on spurious p-values as well, although how to do it properly and if it works is still something I'm digging into.
If you've got any suggestions on how to look further into this I'd be interested in hearing your thoughts.
Well, my first thought is to look at what the PCs look like on those chromosomes, and see what that turns up? One option, I suppose, is that just linkage is so strong that there's chromosome-wide shared patterns.
It looks like using local kinship matrices in GWAS can really cut down on spurious p-values as well, although how to do it properly and if it works is still something I'm digging into.
Oh, great! Glad to hear someone is investigating this.
Thanks, the population I'm working with is very small with lots of structure (endangered). I've found local kinship matrices with just 1000 SNPs along a sliding window seems to help, but I feel a more formalized approach would be better. If you are interested in helping I'd definitely be open to hearing your thoughts.
As for the extremes here, they do increase with more PCs and MDS components, generating a multidimensional "star" pattern. Looking at the structure of populations along the chromosome from fastStructure output confirms something is going on with allele frequencies, so it does look like a real signal.
Hey, hope you are doing well. Pardon me for using a github issue as a way to communicate. Happy to send an email as well.
I converted the bulk of lostruct into python, and getting the same results (minus a few deep decimal points, due to floating point issues.. R=0.99999983. or thereabouts).
I saw l2 norm in the code, and was curious if you'd ever used it before? Or was it just exploratory?
Thanks.
I converted the bulk of lostruct into python
Oh, nice! I'd like to link to that here, if you've got it on github - people might like to use it!
I saw l2 norm in the code, and was curious if you'd ever used it before? Or was it just exploratory?
We compared it on a few examples and got similar results... so yes, pretty much just exploratory. I haven't got a strong reason why one would be better than the other.
Yeah, let me clean it up a bit. Is it ok if I call it lostruct-py? Otherwise I'll give it another name.
Cool, thanks for letting me know regarding the L2 norm. I've been digging into different GRM preparation methods, so got curious about different things for lostruct as well, but it's probably a bit too much out of my expertise.
Please do. I'm also happy to host it in this repo, and make you a co-maintainer, if you want - either way.
Take a look: https://github.com/jguhlin/lostruct-py
Let me know if you have any issues / concerns.
Great! It looks fantastic! I'm making some minor comments over there. It'd be nice to get some unit tests in before you release it.
I've put a link in the README, but calling it a "prototype" for now; bug me to change that if you feel it's ready for a different adjective.
Prototype sounds good.
I've added unit tests, probably not quite what you were thinking but fits the bill (for now). Open to ideas. I also created a Zenodo DOI, please double check your info: https://zenodo.org/record/3997106
I'll be adding in weights and unit tests for those soonish. Would be good to test with our 2 populations.
It's pretty stable, and I'm able to use it for looking into an endangered species population and get insight into conservation breeding.
Feel free to reach out anytime, and I'm sure I'll bug you with more questions in the future.
Cheers, --Joseph
The DOI checks out. I'll close this issue and move discussion over to that repo. Thanks!
It's pretty stable, and I'm able to use it for looking into an endangered species population and get insight into conservation breeding.
Oh, that's great. Looking forward to hearing about it.
Receiving this error and not sure why. Any ideas?
`Here are all pairwise plots of the first 4 PCs for each of the three corners:
layout(t(1:3)) for (i in 1:(corner.npc-1)) { for (j in (i+1):corner.npc) { for (k in 1:ncol(mds.corners)) { vectors <- matrix( corner.pca[[k]][-(1:(1+corner.npc))], ncol=corner.npc )[,c(i,j)] colnames(vectors) <- paste("PC", c(i,j)) par(mgp=c(0.7,0.7,0), mar=c(2,2,2,0)+.1) plot(vectors, pch=pop.pch[samps$population], col=pop.cols[samps$population], xaxt='n', yaxt='n' ) if (i==1 && j==2) { mtext(paste("corner",k),side=3) } } if (do.pdfs) { pdfcopy(plot.id=paste(i,j,sep="")) } } }
Error in matrix(corner.pca[[k]][-(1:(1 + corner.npc))], ncol = corner.npc): 'data' must be of a vector type, was 'NULL'`