statgen / Minimac4

GNU General Public License v3.0
54 stars 17 forks source link

Error writing output #55

Closed aokulabasile closed 1 year ago

aokulabasile commented 1 year ago

Hello,

I am imputing ~490,000 samples using Minimac4 v4.1.2. As imputation nears the final set of samples I am encountering the following error:

Error: failed writing output
Error: index file too big for skippable zstd frame
Error: could not append S1R index

Here is the command that I am using: ./minimac4 ref. study_data.vcf.gz --all-typed-sites --format GT,DS,GP,HDS --temp-buffer 300 --region 22:31000001-5100000 --output-format bcf --output imputed_chr22-31000001-5100000.bcf --threads 40

Any idea what is causing this error? Thanks for your help

jonathonl commented 1 year ago

Do you have write access to /tmp? If so, does /tmp have adequate space to store the output?

aokulabasile commented 1 year ago

Thanks, I think it's an issue with having enough space to store the output in /tmp. Does minimac take a flag to modify the tmp path, or should I to set the TMPDIR environment variable? Thanks for your help.

jonathonl commented 1 year ago

Neither of these approaches will work currently, but I can add this feature (probably later today).

aokulabasile commented 1 year ago

Thanks for your help!

jonathonl commented 1 year ago

The option --temp-prefix has been added with 59317e180fa823c1ab960d1eec23d88383be4631. It now also supports TMPDIR when --temp-prefix is not specified.

aokulabasile commented 1 year ago

Thank you for adding this option. How should this flag be used?

I tried specifying a temp dir with the new option(See the following), but the software doesn't run, it just prints out what looks like the options from --help . When I run without this flag, minimac runs as it should. --temp-prefix /scratch/test_ab_minimac_temp/

I also tried exporting TMPDIR as an environment variable, and matching the flag exactly the way it appears in the help example, (i.e. {TMPDIR}/m4_) and this also printed the usage options for minimac without running the tool.

jonathonl commented 1 year ago

Sorry, just pushed a fix (https://github.com/statgen/Minimac4/commit/287c084fc7d925bf1199e0573bc10a5e22dd286f). Please pull latest and try again.

jonathonl commented 1 year ago

Note: if you use --temp-prefix, the TMPDIR environment variable will be ignored. TMPDIR must be a directory. --temp-prefix must be an existing directory path with trailing slash plus an optional filename prefix (e.g., /tmp/ or /tmp/optional_prefix). A directory without a trailing slash (e.g., /tmp) will not work.

aokulabasile commented 1 year ago

Thanks. I was now able to run imputation successfully with the new temp dir flag. Thanks for making these changes!

dpelegri commented 1 year ago

Hello, I get the same error as aokulabasile, I have installed the latest version and run the command

~/software/minimac4/bin/minimac4 ${ref_panel}${i}.1000g.Phase3.v5.With.Parameter.Estimates.msav \ ${src_path}chr${i}.vcf.gz \ --output /LocallyImputed/1000GP3v5 \ --output-format vcf.gz \ --threads 30 \ --temp-prefix /scratch/minimac_tmp/

i get the error

Error: index file too big for skippable zstd frame Error: could not append S1R index

I have checked the temp path and I have not seen any files, should I find any temporary files?

Thanks for your help

jonathonl commented 1 year ago

Temporary files are immediately unlinked from the filesystem after opening, so they may not be visible to you. I'm guessing you do not have adequate space on /scratch/minimac_tmp/ to fit the output files. How large is that disk? How many individuals are you imputing?

dpelegri commented 1 year ago

Hello @jonathonl ,

I am imputing data from UKBiobank with 488377 samples. In /scratch I have:

Filesystem                                      Size  Used Avail Use% Mounted on
/dev/mapper/centos-scratch                      1.0T  1.5G 1023G   1% /scratch

Is it enough to be able to make the imputation?

Thanks,

jonathonl commented 1 year ago

I don't know. Does it run for a while before producing this error? Are you running all chromosomes simultaneously?

I would try running a single 10 Mbp chunk (e.g., -r chr20:40000001-50000000) to see what the output file size is. You could then extrapolate to get the total disk space needed.

jonathonl commented 1 year ago

Also, I just noticed that you are running on centos. Which version of centos are you running on? And how did you install Minimac4?

dpelegri commented 1 year ago

I execute the imputation on each chromosome separately and I am testing it with chromosomes 20 and 19 in different nodes of the cluster, each node with an independent /scratch and with identical characteristics,. minimac runs for a while, it does not always process the same number of samples but I have seen that it can process about 100,000 samples before showing the error. I have a drive with 16T available, I will try to do the imputation by setting the --temp-prefix on this drive. I try and tell you.

Thank you so much,

dpelegri commented 1 year ago

With 16TB availabIe I get the same error:

Completed 50700 of 488377 samples
Completed 50800 of 488377 samples
Completed 50900 of 488377 samples
Error: index file too big for skippable zstd frame
Error: could not append S1R index

I'm using the command:

${minimac4} ${ref_panel}${i}.1000g.Phase3.v5.msav \
                         ${src_path}chr${i}.vcf.gz \
                         --output-format vcf.gz \
                         --output ${dest_path} \
                         --temp-prefix ${temp_path} \
                         --threads 30

where ${i} is the chromosome 20.

Thanks,

jonathonl commented 1 year ago

Which version of centos are you running? Also, can you send the full log output?