rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

"0 in the intersection of the posfile and the reference haplotypes" #14

Open karstyl opened 2 years ago

karstyl commented 2 years ago

Hello,

I am running quilt on a set of data where I have ~1300 F2's with shallow sequencing and high quality phased genotypes from the F0's.

After running most of my data it all looked good except one chromosome, and I realized that I had the position data from an old version of the genome, so I updated the portions of the variants in the impute.legend file.

When I tried to re-run the chromosome, I am getting this in the output: [2022-03-16 10:06:12] In the region to be imputed plus buffer there are the following number of SNPs: [2022-03-16 10:06:12] 505727 from the posfile [2022-03-16 10:06:12] 505727 from the reference haplotypes [2022-03-16 10:06:12] 0 in the intersection of the posfile and the reference haplotypes Error in print_and_validate_reference_snp_stats(pos_snps, legend_snps, : There are 0 SNPs that intersect between the posfile and the reference haplotypes. Please troubleshoot to see if there is an error, or re-run without reference haplotypes

This is the command I am running: /home/jjohnso/project-aces/apps/QUILT/QUILT/QUILT.R \ --outputdir=${dirname} \ --chr=${chr} \ --tempdir=temp.${dirname} \ --output_filename=quilt.chr${chr}a.set${set}.vcf \ --bamlist=shortbamlist${chr}.txt \ --sampleNames_file=shortbamlist${chr}.names \ --reference_haplotype_file=gpinfo/chr${chr}.impute.hap.gz \ --reference_legend_file=gpinfo/chr${chr}.impute.legend.gz \ --genetic_map_file=gpinfo/chr${chr}.geneticmap.txt \ --nGen=${nGenval} \ --nCores=11 \ --save_prepared_reference=TRUE

I am not providing a posfile, so I assume that it previously made one with the other positions that it has stored in some other location.

I have removed the old directories, changed the temp directory, changed the file names, and double checked my impute.legend files.

-Jen Jhohnson

rwdavies commented 2 years ago

Hi Jen,

That is very weird. Indeed there is no "posfile" here, it's just the way I'm re-using code from STITCH, both the "posfile" and the legend file are derived from the legend. Somewhat suspicious that there are exactly the same number of SNPs in both, but no overlap.

If it's not too much trouble, are you able to add some print statements to the code, re-install QUILT, then try re-running? For instance the problematic line is line 895 with "print_and_validate_reference_snp_stats", but if you put some print statements just above it starting from current line 893, like print(head(pos_snps)) print(head(legend_snps)) maybe that will give some insight into why the code is now running into problems. You can then re-install QUILT using something like

cd /home/jjohnso/project-aces/apps/QUILT/QUILT/
./scripts/build-and-install.R

Sorry I can't immediately see the problem, Best wishes, Robbie

karstyl commented 2 years ago

Robbie,

Sure, I can do that. Which code file should I edit?

-Jen

rwdavies commented 2 years ago

Right, sorry, it would be this file https://github.com/rwdavies/STITCH/blob/master/STITCH/R/reference.R so you'd actually need to download and compile STITCH if not already done then re-build and install using

cd ~/proj/STITCH ## or similar
./scripts/build-and-install.R
karstyl commented 2 years ago

I figured out the issue. When you mentioned that it was not the issue of stored data, I knew it had something to do with my files. I have run the rest of my chromosomes without this issue. It turns out that the issue was the program that converted the positions added a carriage return instead of a newline. I thought this would have been fixed when I re-wrote the file using awk and printing it column by column, but the carriage return stayed in the file.

For others who find this same issue: To find it" cat -A filename will show '^M$' instead of '$' tr -d $'\r' < filename will fix it.