statgen / Minimac4

GNU General Public License v3.0
54 stars 17 forks source link

record mismatch in temp files when -a option is active #77

Open edg1983 opened 1 month ago

edg1983 commented 1 month ago

Hi,

I'm using minimac4 v4.1.3 to impute genotypes on a cohort of about 24k individuals.

Usually, I run one imputation job per chromosome. When I run the command with mostly default settings, it works fine and generates an output vcf.gz file containing the same number of variants of the reference panel file (as expected). See this example:

minimac4 -t 12 -b 500 \
    -f GT,DS,HDS,GP \
    -o chr22.imputed.vcf.gz \
    chr22.refpanel.msav chr22.target.vcf.gz

However, if I add the -a option, I have an error merging the temp files at the end of the imputation process. The resulting vcf.gz file is truncated and contains fewer variants than those in the reference panel.

This is the command

minimac4 -t 12 -b 500 -a \
    -f GT,DS,HDS,GP \
    -o chr22.imputed.vcf.gz \
    chr22.refpanel.msav chr22.target.vcf.gz

Here is the error from the log

Running HMM took 114 seconds
Writing temp files took 89 seconds
Merging temp files ...
Error: record mismatch in temp files
Error: failed merging temp files

Am I doing something wrong here? Thanks!

jonathonl commented 1 month ago

Can you check to see if you have enough disk space in /tmp to store the chunked results? I think we would have seen an error message earlier in the logs if an error occurred writing the temp files, but that's the only good explanation I have for why this would happen.

Otherwise, is there anything special about the variant immediately after the last one written to output file? What operating system are you running this on and how did you install Minimac4?

edg1983 commented 1 month ago

Hi,

I don't think the issue is related to storage space. I see in the log files a message like Writing temp files took 344 seconds; hence, I assume that all temp files were written correctly.

I currently use Minimac4 on our HPC cluster, which runs on CentOS 8. We grabbed the pre-compiled executable provided with the release on GitHub. It has worked fine so far in all other tests; it is just the -a option that creates issues apparently.

I'll check if I see anything strange in the last variant written to the file and the next one in the imputation ref panel.

jonathonl commented 1 month ago

Ok, if there is something strange, I'm guessing it will be in the next variant in your target VCF (as opposed to the reference VCF).

jonathonl commented 1 month ago

I'm guessing that this is happening because there is target-only variant that has all of the genotypes missing for a batch of samples. This is a bug that I'll need to fix, though phasing software should impute such genotypes. Are you phasing your target vcf before imputing?

edg1983 commented 1 month ago

Hi, I'm imputing VCF files from genotyping directly after QC without phasing them.

I'm now re-running the test with -a option to check on the last written variant and the next one in the input VCF. I'll update you here as soon as this is done.

jonathonl commented 1 month ago

You will get very poor imputation results if you impute unphased genotypes (or if you impute with an unphased reference panel). Both input files should be phased.

edg1983 commented 2 weeks ago

I've tried with imputed genotypes, and I confirm this works fine with the -a option.