statgen / ruth

Robust Unified Hardy-Weinberg Equilibrium Test
Apache License 2.0
6 stars 2 forks source link

Bus Error #3

Closed jjfarrell closed 1 year ago

jjfarrell commented 4 years ago

The following error is occurring when running RUTH on a couple of chromosomes from Lumpy/SVTyper vcf of 4789 samples. A Bus error occurs and then an error about the truncated file. The file is not truncated. I extracted that region out of the VCF and the error still occurs. RUTH has run fine on 4 other sets of SV calls. It has also run fine on 20 other chromosomes from LUMPY. There is also a similar error on chr18. Seems to be catching some edge case. Any suggestions?

run RUTH on passed variants

Available Options

The following parameters are available. Ones with "[]" are in effect:
                              Input Options : --evec [adsp5k.evec],
                                              --vcf [adsp5k.lumpy.duphold.chr2.vcf.gz],
                                              --thin [1.00], --seed,
                                              --num-pc [4], --field [GT],
                                              --gt-error [5.0e-03],
                                              --lambda [1.00]
                             Output Options : --out [adsp5k.lumpy.duphold.chr2.ruth.vcf.gz],
                                              --skip-if, --skip-info,
                                              --site-only, --nelder-mead,
                                              --lrt-test, --lrt-em
                        Samples to focus on : --sm-list
             Parameters for sex chromosomes : --sex-map, --x-label [X],
                                              --y-label [Y], --mt-label [MT],
                                              --x-start [2699520],
                                              --x-stop [154931044]
   Options to specify when chunking is used : --ref, --unit [2147483647],
                                              --interval, --region

Run with --help for more detailed help messages of each argument.

NOTICE [2019/11/18 21:29:19] - Analysis Started
NOTICE [2019/11/18 21:29:19] - Reading sample eigenvectors
NOTICE [2019/11/18 21:29:19] - Identifying sample columns to extract..
NOTICE [2019/11/18 21:29:19] - Reading in BCFs...
NOTICE [2019/11/18 21:29:19] - Finished identifying 4789 samples to load from VCF/BCF
NOTICE [2019/11/18 21:29:22] - Reading 100 variants at chr2:89842, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:24] - Reading 200 variants at chr2:174449, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:27] - Reading 300 variants at chr2:239892, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:30] - Reading 400 variants at chr2:318885, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:33] - Reading 500 variants at chr2:389341, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:35] - Reading 600 variants at chr2:457485, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:38] - Reading 700 variants at chr2:538215, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:41] - Reading 800 variants at chr2:644346, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:43] - Reading 900 variants at chr2:718898, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:46] - Reading 1000 variants at chr2:776468, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:49] - Reading 1100 variants at chr2:863881, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:52] - Reading 1200 variants at chr2:944586, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:54] - Reading 1300 variants at chr2:1011389, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:29:57] - Reading 1400 variants at chr2:1096479, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:00] - Reading 1500 variants at chr2:1155196, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:03] - Reading 1600 variants at chr2:1221493, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:05] - Reading 1700 variants at chr2:1303456, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:08] - Reading 1800 variants at chr2:1361836, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:11] - Reading 1900 variants at chr2:1462216, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:14] - Reading 2000 variants at chr2:1523686, Skipping 0, Missing 0.
NOTICE [2019/11/18 21:30:16] - Reading 2100 variants at chr2:1527004, Skipping 0, Missing 0.
/var/spool/sge/scc-wr1/job_scripts/940209: line 11: 28537 Bus error               ruth --vcf $VCF --evec adsp5k.evec --field $FIELD --out $VCF_RUTH
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
jjfarrell commented 4 years ago

I tried a different algorithm (-nelder-mead) and instead got a segmentation fault instead of bus error. The variant triggering the error is a rare DUP with mostly het calls but with a couple homozygous samples.

var/spool/sge/scc-ym2/job_scripts/945469: line 11: 279846 Segmentation fault      ruth --nelder-mead --vcf $VCF --evec adsp5k.evec --field $FIELD --out $VCF_RUTH
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
hyunminkang commented 4 years ago

It looks that your input VCF file seems truncated


Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Nov 19, 2019 at 8:03 AM jjfarrell notifications@github.com wrote:

I tried a different algorithm (-nelder-mead) and instead got a segmentation fault instead of bus error. The variant triggering the error is a rare DUP with mostly het calls but with a couple homozygous samples.

var/spool/sge/scc-ym2/job_scripts/945469: line 11: 279846 Segmentation fault ruth --nelder-mead --vcf $VCF --evec adsp5k.evec --field $FIELD --out $VCF_RUTH [W::bgzf_read_block] EOF marker is absent. The input is probably truncated

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/ruth/issues/3?email_source=notifications&email_token=ABPY5OMSNJMSZZA6NRYZC6DQUPP2RA5CNFSM4JPCWGYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEODUWQ#issuecomment-555498074, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OKH3NGYX3NDRGC6NZDQUPP2RANCNFSM4JPCWGYA .

jjfarrell commented 4 years ago

The error suggests that but the file is not truncated. It is indexed with tabix with no errors. zcat vcf.gz|wc runs without an error. If I extract that region into a vcf.gz with tabix with no error , the ruth error still occurs on the subset.

hyunminkang commented 4 years ago

Does bcftools work?

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Nov 19, 2019 at 9:08 AM jjfarrell notifications@github.com wrote:

The error suggests that but the file is not truncated. It is indexed with tabix with no errors. zcat vcf.gz|wc runs without an error. If I extract that region into a vcf.gz with tabix with no error , the ruth error still occurs on the subset.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/ruth/issues/3?email_source=notifications&email_token=ABPY5ONXLNJ4EMOIYVDNRRDQUPXNFA5CNFSM4JPCWGYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEOJYMI#issuecomment-555523121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OLSAW5MVHG2PQB7HJDQUPXNFANCNFSM4JPCWGYA .

jjfarrell commented 4 years ago

Yes, with no truncation error...

bcftools view chr2_test.vcf.gz|wc 3536 521960 34900080 bcftools view adsp5k.lumpy.duphold.chr2.vcf.gz|wc 340923 1619304786 105957815767

hyunminkang commented 4 years ago

Hmm.. then there might be something strange happening because the error is happening in htslib (in bgzf) not in cramore. Are you using the latest version of htslib?

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Nov 19, 2019 at 12:16 PM jjfarrell notifications@github.com wrote:

Yes, with no truncation error...

bcftools view chr2_test.vcf.gz|wc 3536 521960 34900080 bcftools view adsp5k.lumpy.duphold.chr2.vcf.gz|wc 340923 1619304786 105957815767

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/ruth/issues/3?email_source=notifications&email_token=ABPY5ONN3JCNYRBIGKUVA43QUQNPVA5CNFSM4JPCWGYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEO7RQQ#issuecomment-555612354, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OPIACCJKL3CEFUEFBLQUQNPVANCNFSM4JPCWGYA .

jjfarrell commented 4 years ago

Either htslib 1.8 or 1.9 On the test vcf, it runs if the field specified is GL instead of GT.

uth.sh chr2_test.vcf.gz GL
Run Ruth on passed variants

Available Options

The following parameters are available. Ones with "[]" are in effect:
                              Input Options : --evec [adsp5k.evec],
                                              --vcf [chr2_test.vcf.gz],
                                              --thin [1.00], --seed,
                                              --num-pc [4], --field [GL],
                                              --gt-error [5.0e-03],
                                              --lambda [1.00]
                             Output Options : --out [chr2_test.ruth.vcf.gz],
                                              --skip-if, --skip-info,
                                              --site-only, --nelder-mead,
                                              --lrt-test, --lrt-em
                        Samples to focus on : --sm-list
             Parameters for sex chromosomes : --sex-map, --x-label [X],
                                              --y-label [Y], --mt-label [MT],
                                              --x-start [2699520],
                                              --x-stop [154931044]
   Options to specify when chunking is used : --ref, --unit [2147483647],
                                              --interval, --region

Run with --help for more detailed help messages of each argument.

NOTICE [2019/11/20 15:08:32] - Analysis Started
NOTICE [2019/11/20 15:08:32] - Reading sample eigenvectors
NOTICE [2019/11/20 15:08:32] - Identifying sample columns to extract..
NOTICE [2019/11/20 15:08:32] - Reading in BCFs...
NOTICE [2019/11/20 15:08:32] - Finished identifying 4789 samples to load from VCF/BCF
NOTICE [2019/11/20 15:08:37] - Reading 100 variants at chr2:1527586, Skipping 0, Missing 0.
NOTICE [2019/11/20 15:08:38] - Analysis Finished
[farrell@scc-hadoop duphold]$

[farrell@scc-hadoop duphold]$ ./ruth.sh chr2_test.vcf.gz GT
Run Ruth on passed variants

Available Options

The following parameters are available. Ones with "[]" are in effect:
                              Input Options : --evec [adsp5k.evec],
                                              --vcf [chr2_test.vcf.gz],
                                              --thin [1.00], --seed,
                                              --num-pc [4], --field [GT],
                                              --gt-error [5.0e-03],
                                              --lambda [1.00]
                             Output Options : --out [chr2_test.ruth.vcf.gz],
                                              --skip-if, --skip-info,
                                              --site-only, --nelder-mead,
                                              --lrt-test, --lrt-em
                        Samples to focus on : --sm-list
             Parameters for sex chromosomes : --sex-map, --x-label [X],
                                              --y-label [Y], --mt-label [MT],
                                              --x-start [2699520],
                                              --x-stop [154931044]
   Options to specify when chunking is used : --ref, --unit [2147483647],
                                              --interval, --region

Run with --help for more detailed help messages of each argument.

NOTICE [2019/11/20 15:08:49] - Analysis Started
NOTICE [2019/11/20 15:08:49] - Reading sample eigenvectors
NOTICE [2019/11/20 15:08:49] - Identifying sample columns to extract..
NOTICE [2019/11/20 15:08:49] - Reading in BCFs...
NOTICE [2019/11/20 15:08:49] - Finished identifying 4789 samples to load from VCF/BCF
NOTICE [2019/11/20 15:08:54] - Reading 100 variants at chr2:1527586, Skipping 0, Missing 0.
./ruth.sh: line 11: 34803 Segmentation fault      ruth --vcf $VCF --evec adsp5k.evec --field $FIELD --out $VCF_RUTH
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
jjfarrell commented 1 year ago

Not seeing this error with the latest version on recently created vcf files.