Closed jdmanthey closed 4 years ago
I believe I figured out the answer to the problem. If there is phasing information in the genotype, e.g., a genotype of 0|1 instead of 0/1, then the program does not recognize the data and NA values appear. Once I replaced all the | with / in the vcf files before converting to bcf, the problem disappeared and I had no more NA values in the PCA output tables.
Hm, that's odd. Can you pull out the genotype matrix for one of the troublesome windows and look at it?
Hm, strange - the vcf_query( )
function is supposed to deal with either phased or unphased formats. Unless your VCF files are a mix of both phased and unphased data? The function infers which format using the first 100 lines. If you can provide a small example VCF file that reproduces the problem, I'd be happy to fix it.
I was running the analyses on my workstation at the office, so I will send you an example some time tomorrow. The data I had that wasn't working was GATK 4.1.0 output that include both phased and unphased genotypes depending on available information for each position.
Ah-ha - I didn't think about that possibility. I've pushed a fix that should be able to read your data without modification - let me know (and re-open this issue) if not? Thanks for the report!
Dear Dr. Ralph,
I am attempting to run your lostruct program, and it appears that it is working. However, I am getting a large portion of windows with NA values in the PC and MDS output (up to 30% of windows on a chromosome). At first, I thought this may be because I was using 50kbp windows and there wasn't enough variation in some windows, but I still get a few NA values even when using the program with windows of 10,000 SNPs. I am using a dataset that has no missing data and 22 individuals. I am running these analyses using the templated directory method and the command line.
Any thoughts on what may be causing these results or what I could do to troubleshoot?
Thanks, Joe Manthey