petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Issue with my file? Lots of different errors #35

Closed lilymaya closed 6 months ago

lilymaya commented 6 months ago

Hi, I am having some trouble getting the program to run, with a couple of different errors coming up.

I had tested the program previously and found my files ran fine. However, I have adjusted my filtering etc and now I am finding that my new files are not working. It is strange because other than a filter or two, the files were prepared in the same way.

The first step, snps <- read_vcf("chr1.recode.vcf"), produced the following error:

Error in dim(haps) <- c(2, dim(dips)) : 
  dims [product 50957024] do not match the length of object [50953094]

I saw the issue https://github.com/petrelharp/local_pca/issues/28 and tried using vcfR instead to read in the file, as suggested. However, when I went to the next step, pcs <- eigen_windows(snps, win= 100, k=2), I got the following error:

Error in rowMeans(x, na.rm = TRUE) : 
  'x' must be an array of at least two dimensions

So then I tried using bcf files as input as described in the github readme: snps <- vcf_windower("chr1.bcf", size = 669328, type = 'snp') (note here that I used the number of SNPs as the size input, but perhaps I was wrong and this caused the next issue?). I got warning messages but otherwise it seemed fine:

Warning messages:
1: In any(chrom.wins) : coercing argument of type 'double' to logical
2: In vcf_windower_snp(file = file, sites = sites, size = size, samples = samples) :
  Trimming from chromosome ends: Chr2: 0 SNPs.

The next step, pcs <- eigen_windows(snps, win= 100, k=2), also seemed to run fine. But then when I ran the distance function, pcdist <- pc_dist(pcs, npc = 2), I got the following error:

Error in array(STATS, dims[perm]) : 'dims' cannot be of length 0
In addition: Warning message:
In sweep(x[, -(1:(1 + npc))], 2, rep(sqrt(w), npc), "*") :
  STATS is longer than the extent of 'dim(x)[MARGIN]'

So now I am at a bit of standstill for what to try next. I am wondering if there is some sort of input file issue, given that this problem didn't arise with my other files. However, I cannot identify a significant difference in the files that would cause this issue. Any suggestions would be appreciated, thank you so much! :)

lilymaya commented 6 months ago

For anyone else that comes across these issues, it turns out that my very small amount of missing data (<0.5%) was causing the problems. Once I removed it, everything worked as normal. So make sure to remove ALL missing data! :)

petrelharp commented 6 months ago

Oh, good - I'm glad you figured it out! (and before I managed to dig into this!)

Missing data is in general okay; there must be something special about how your missing data were structured that made things fail - for instance, maybe there was an entire window missing, or two samples had no non-missing overlap in a window, or something like that.

lilymaya commented 6 months ago

Oh interesting, thanks! :)