NA values in PCA and MDS results

petrelharp / local_pca

Methods for examining PCA locally along the genome.

71 stars 13 forks source link

NA values in PCA and MDS results #11

Closed jdmanthey closed 4 years ago

jdmanthey commented 4 years ago

Dear Dr. Ralph,

I am attempting to run your lostruct program, and it appears that it is working. However, I am getting a large portion of windows with NA values in the PC and MDS output (up to 30% of windows on a chromosome). At first, I thought this may be because I was using 50kbp windows and there wasn't enough variation in some windows, but I still get a few NA values even when using the program with windows of 10,000 SNPs. I am using a dataset that has no missing data and 22 individuals. I am running these analyses using the templated directory method and the command line.

Any thoughts on what may be causing these results or what I could do to troubleshoot?

Thanks, Joe Manthey

jdmanthey commented 4 years ago

I believe I figured out the answer to the problem. If there is phasing information in the genotype, e.g., a genotype of 0|1 instead of 0/1, then the program does not recognize the data and NA values appear. Once I replaced all the | with / in the vcf files before converting to bcf, the problem disappeared and I had no more NA values in the PCA output tables.

petrelharp commented 4 years ago

Hm, that's odd. Can you pull out the genotype matrix for one of the troublesome windows and look at it?

petrelharp commented 4 years ago

Hm, strange - the vcf_query( ) function is supposed to deal with either phased or unphased formats. Unless your VCF files are a mix of both phased and unphased data? The function infers which format using the first 100 lines. If you can provide a small example VCF file that reproduces the problem, I'd be happy to fix it.

jdmanthey commented 4 years ago

I was running the analyses on my workstation at the office, so I will send you an example some time tomorrow. The data I had that wasn't working was GATK 4.1.0 output that include both phased and unphased genotypes depending on available information for each position.

petrelharp commented 4 years ago

Ah-ha - I didn't think about that possibility. I've pushed a fix that should be able to read your data without modification - let me know (and re-open this issue) if not? Thanks for the report!