statgen / Minimac4

GNU General Public License v3.0
56 stars 20 forks source link

Parallel bgzip cram #13

Open daheise opened 6 years ago

daheise commented 6 years ago

This pull request may belong on a branch other than master since it depends on a PR being accepted in another project, namely this one.

If libStatGen accepts the above PR, then this PR modifies Minimac to use the parallel bgzip functionality, and adds additional parallelism in the final append.

I do not have a way to perform regression tests for correctness tests at this time. Let me know if there are any changes to the PR needed to aid acceptance.

Known Issue The cget requirements file will need to be tweaked to point to the correct libStatGen prior to acceptance, if libStatGen accepts the dependent PR.

daheise commented 6 years ago

I did test with the test data included in Eagle tables and example folders. I got the same output from Minimac4 master and this PR.

avsmith commented 5 years ago

@Santy-8128 @jonathonl Can you take a look at this pull request, and commit if passes muster.

jonathonl commented 5 years ago

This will likely conflict with sav-support-2 branch, which is being tested by Imputation Server team. Once sav-support-2 branch is merged into master, I can potentially review this for inclusion.

avsmith commented 5 years ago

@daheise What is the value add here? We don't anticipate a great gain. Much is dependent on libStatGen and we are concerned with this making breaking changes in other tools.

daheise commented 5 years ago

We're running imputation on data from 400K+ individuals. Minimac does an extremely long running single threaded compressed write to disk. These changes reduce the run time of our longest running chunk from 111 hours to 70 hours.

For reference, this is also related to resolving issue #11. I should have mentioned that in the PR.

jonathonl commented 5 years ago

We are working on a different solution for the problem you are experiencing and this PR is in conflict with that vision. If you are satisfied with what you've implemented, then I recommend you continue using your fork. You could also try the sav-support-2 branch and add --sav to your command. This reduces run time by using the SAV storage format for temp and output files. The other alternative is to spread the samples you are imputing across multiple minimac4 processes.

daheise commented 5 years ago

I will see if I can accomplish a test run on our data using SAV, can you provide additional information regarding that format? Do you provide a conversion to m3vcf when using that format?

We have tried spreading the samples across many processes -- the 111 hour number is from one chunk out of 154.

jonathonl commented 5 years ago

SAV is only for output from minimac4, so you would continue using m3vcf for the reference panel input. You can get the sav CLI from https://github.com/statgen/savvy/releases, which allows you to export the imputed output to vcf.

daheise commented 5 years ago

Is SAV already the output format of Minimac4 on master, or are you saying that will be the only output file once the sav-support-2 branch in merged?

I did pullsav-support-2 into my parallel-bgzip-cram, and got one significant merge conflict:

++<<<<<<< HEAD
 +    IFILE vcfdosepartial = ifopen(PartialVcfFileName.c_str(), bgzf_mode, InputFile::BGZF);
 +    IFILE vcfLoodosepartial = NULL;
++=======
+     IFILE vcfdosepartial = nullptr;
+     IFILE vcfLoodosepartial = nullptr;
+     std::unique_ptr<savvy::sav::writer> temp_sav_file = nullptr;
+ 
+ 
+     savvy::site_info sav_variant_anno;
+     if (MyAllVariables->myOutFormat.savOutput)
+     {
+         std::vector<std::pair<std::string,std::string>> empty_vec;
+         savDoseBuf.resize(2 * BuffernumSamples);
+         temp_sav_file = savvy::detail::make_unique<savvy::sav::writer>(PartialVcfFileName,
+           individualName.begin() + ((NovcfParts - 1) * BuffernumSamples), individualName.begin() + std::min(individualName.size(), std::size_t(NovcfParts * 
+           empty_vec.begin(), empty_vec.end(),
+           savvy::fmt::hds);
+     }
+     else
+     {
+         vcfdosepartial = ifopen(PartialVcfFileName.c_str(), "wb", InputFile::BGZF);
+     }
+ 
++>>>>>>> 38f4f1695598fd262f5a8b1585adce8a70fbbcf8
daheise commented 5 years ago

Is it correct that the --sav option produces the equivalent of --format GT? In our use case we're using --format GT,DS,HDS,GP. Will SAV support additional format options?

jonathonl commented 5 years ago

Is SAV already the output format of Minimac4 on master, or are you saying that will be the only output file once the sav-support-2 branch in merged?

The master branch only outputs vcf files. The sav-support-2 branch is a first step towards better file output, which adds support for sav. Eventually, the savvy library will be used to output sav, vcf and bcf files. But for now there is a hybrid of savvy and libStatGen.

Is it correct that the --sav option produces the equivalent of --format GT? In our use case we're using --format GT,DS,HDS,GP. Will SAV support additional format options?

The --sav option produces the equivalent of HDS. Storing the other 3 format fields is redundant because they can all be calculated from HDS. The sav cli use the -d, --data-format option to specify how to view the data (e.g., sav export -d HDS file.sav, sav export -d DS file.sav, etc.)