statgen / Minimac4

GNU General Public License v3.0
54 stars 18 forks source link

Minimac4 -- 'segv' / 'Segmentation Fault (core dumped)' Error #8

Open CScottGallagher opened 6 years ago

CScottGallagher commented 6 years ago

Hello, I've been trying to run Minimac4 to impute a cohort of roughly 300,000 individuals. When attempting to impute a single chunk of 5 Mb +/- 1 Mb on chromosome 11, I am routinely encountering a 'Segmentation Fault (core dumped)' error. The segv error occurs even when I reduce the input size to 5,000 individuals. Are you aware of any bugs in the software that would produce this error type?

yukt commented 6 years ago

Hi. We have recently fixed a bug regarding some segmentation fault, and Minimac4 has been updated to version 1.0.1. Could you please check whether the software you are using is up-to-date? Thanks!

CScottGallagher commented 6 years ago

Hello! Thanks for the quick response. We have updated to version 1.0.1 but we are still observing the following error:

".../1532025929.2118.shell: line 34: 16770 Aborted (core dumped) $MINIMAC4_EXEC --refHaps $REF_HAPLOTYPES --haps $SAMPLE_HAPLOTYPES --format GT,DS,GP --passOnly -- allTypedSites --chr $CHR --start $CHUNK_START --end $CHUNK_END --window 100000 --prefix chr11.02.03 --log"

We are attempting to run this on a cohort of more than 300,000 individuals on a single chunk 1 Mb +/- 100 kb. Do you have any advice for circumventing the error?

Santy-8128 commented 6 years ago

As a quick test would it be possible to run the test examples that came with the Minimac3 package (link provided below) ?

I Apologize that we removed the test cases in minimac4. We will fix that soon. Untill then please try with the M3 test cases and let us know if it still seg faults.

https://github.com/Santy-8128/Minimac3/tree/master/test

Regards, Sayantan Das,

On Fri, Jul 20, 2018 at 8:31 AM CScottGo notifications@github.com wrote:

Hello! Thanks for the quick response. We have updated to version 1.0.1 but we are still observing the following error:

".../1532025929.2118.shell: line 34: 16770 Aborted (core dumped) $MINIMAC4_EXEC --refHaps $REF_HAPLOTYPES --haps $SAMPLE_HAPLOTYPES --format GT,DS,GP --passOnly -- allTypedSites --chr $CHR --start $CHUNK_START --end $CHUNK_END --window 100000 --prefix chr11.02.03 --log"

We are attempting to run this on a cohort of more than 300,000 individuals on a single chunk 1 Mb +/- 100 kb. Do you have any advice for circumventing the error?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/8#issuecomment-406636467, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuICKB6aLRXAFXO1unRBcITv36yEQ3wks5uIfe2gaJpZM4VVTAn .

CScottGallagher commented 6 years ago

Hi Sayantan,

Thank you so much for your responses. We are now able to successfully get MINIMAC4 to run on a 5 megabase (+/- 1 MB window) on up to 300,000 individuals. When we exceed that and try to run the imputation on ~450,000, we observe a core dump error. In addition to alerting you to this error, I wanted to ask whether there were any scientific consequences to breaking the imputation input into two separate groups? Do the samples (not the reference panel) influence one another's imputation?

Best, Scott

Santy-8128 commented 6 years ago

Hi Scott,

My guess would be that maybe you are running out of memory ?

No, samples don't influence each others imputation. The only advantage to doing them together would be that you would be the whole R-square estimate for each variant (across all samples). If you split it in two groups, your imputed results would be the same, but you would end up with two R-square values for each variant (for each group). In such a case, you would need to calculate the omnibus R-square (across all samples) from the individual batch specific R-squares (which shouldn't be difficult to derive if you know the formula for the minimac R-square)

Regards, Sayantan Das,

23andMe

On Mon, Jul 30, 2018 at 12:46 PM CScottGo notifications@github.com wrote:

Hi Sayantan,

Thank you so much for your responses. We are now able to successfully get MINIMAC4 to run on a 5 megabase (+/- 1 MB window) on up to 300,000 individuals. When we exceed that and try to run the imputation on ~450,000, we observe a core dump error. In addition to alerting you to this error, I wanted to ask whether there were any scientific consequences to breaking the imputation input into two separate groups? Do the samples (not the reference panel) influence one another's imputation?

Best, Scott

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/8#issuecomment-408986764, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuICPoobIpVlHjsrukd9CVXPjIXluqlks5uL2KGgaJpZM4VVTAn .

CScottGallagher commented 6 years ago

Excellent. Sayantan, we are also trying to impute the pseudoautosomal regions, but when defining regions to impute, Minimac4 doesn't seem to know that there is a big gap between PAR1 and PAR2. This leads to defining massive regions to impute and automatic chunking [20Mb]. The only way to get around this is to exclude the block that has this gap in the m3vcf file and define the pseudoautosomal regions so that they don't include this block. **Do you have any advice for circumventing the issue?

In addition, how does the automatic chunking and merging of Minimac4 work? Would there be any scientific consequences compared to manually setting regions to impute?**

Santy-8128 commented 6 years ago

Hi,

Yes, you can always manually use the --chr --start --end options to impute only the PAR1 and PAR2 separately. One should NOT impute PAR1 and PAR2 together anyways since they are on opposite ends of the chromosome and I don't think there is any LD across them (I am not sure of this statement though, just an intuition since they are really far). Please get back if that does not answer your question.

Regards, Sayantan Das,

23andMe

On Wed, Aug 8, 2018 at 8:58 AM CScottGo notifications@github.com wrote:

Excellent. Sayantan, we are also trying to impute the pseudoautosomal regions, but when defining regions to impute, Minimac4 doesn't seem to know that there is a big gap between PAR1 and PAR2. This leads to defining massive regions to impute and automatic chunking [20Mb]. The only way to get around this is to exclude the block that has this gap in the m3vcf file and define the pseudoautosomal regions so that they don't include this block. **Do you have any advice for circumventing the issue?

In addition, how does the automatic chunking and merging of Minimac4 work? Would there be any scientific consequences compared to manually setting regions to impute?**

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/8#issuecomment-411458193, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuICFVZb9ianBeA7HSzmU0fLZGoZIPaks5uOwqXgaJpZM4VVTAn .

CScottGallagher commented 6 years ago

Hi Sayantan,

Could you answer this part of the question as well:

In addition, how does the automatic chunking and merging of Minimac4 work? Would there be any scientific consequences compared to manually setting regions to impute?

Best, Scott

Santy-8128 commented 6 years ago

Hi Scott,

Yes, of course.

The automatic chunking and merging doesn't do anything special apart from: (a) it reads the variant list and uses the value of --chunkLengthMb and --ChunkOverlapMb to get an idea how to chunk the data (the constraint being that the resulting chunks should be at least 20Mb long with at least 3 Mb overlap on either side, based on default values). (b) next, tt imputes each chunk sequentially (including the overlap parts) and saves the result by appending the resulting data (without the overlap) in a final output file. This way there is no need to run a separate concatenation step at the end.

As minimac4 runs the automated chunking, it will print out a summary of the start and end positions of each chunk it ran. If one runs those chunks manually using the --start --end --window option, they would get the exact same results. So, in terms of accuracy, there is no difference in the results between automated and manual chunking, assuming the exact same chunk configurations were run. However, the automated chunking can only impute the chunks sequentially, whereas when running the chunks manually one could impute all chunks in parallel. On the other side, if one runs the chunks manually, they would have to concat the results back to get whole chromosome files, whereas the automated chunking of minimac4 would give you whole chromosome files directly. Does that help?

And lastly, the automated chunking of minimac4 would still be invoked when manually chunking using --start --end. However, if the region specified by --start and --end is smaller than the value of --chunkLengthMb (which is 20 by default), then the automated chunking would treat it as a single chunk. In other words, at any point of time, in order to override the automated chunking one needs to specify a high value of --chunkLengthMb (higher than the region one wants to impute as a single chunk). Please let me know if this helps and/or if there are any other questions ?

Regards, Sayantan Das,

23andMe

On Wed, Aug 8, 2018 at 10:28 AM CScottGo notifications@github.com wrote:

Hi Sayantan,

Could you answer this part of the question as well:

In addition, how does the automatic chunking and merging of Minimac4 work? Would there be any scientific consequences compared to manually setting regions to impute?

Best, Scott

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/Minimac4/issues/8#issuecomment-411487380, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuICHhnQf_Lo9ghCl00GQqLYcfy5Ou5ks5uOx_HgaJpZM4VVTAn .