luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

failed to open temp bcf file #176

Closed davidecarlson closed 3 years ago

davidecarlson commented 3 years ago

Describe the bug Octopus fails while cleaning up temporary bcf files. Here is the error message from the debug log file:

[2021-05-04 07:36:17] <DEBG> Writing 14 calls to output
[2021-05-04 07:36:17] <DEBG> Writing completed task HiC_scaffold_999:0-23242 that finished in 15s
[2021-05-04 07:36:17] <INFO>                     -             100%           5h 52m                 -
[2021-05-04 07:36:17] <DEBG> Merging 6877 temporary VCF files
[2021-05-04 07:37:42] <DEBG> Encountered an error whilst calling, attempting to cleanup
[2021-05-04 07:37:48] <INFO> Removed 13750 temporary files
[2021-05-04 07:37:49] <EROR> A program error has occurred:
[2021-05-04 07:37:49] <EROR>
[2021-05-04 07:37:49] <EROR>     Encountered an exception during calling 'failed to open file
[2021-05-04 07:37:49] <EROR>     /gpfs/projects/HollisterGroup/datahome/rad-seq/gbs_combined/octopus-temp-2/HiC_scaffold_2847_0-25000_temp.bcf'.
[2021-05-04 07:37:49] <EROR>     This means there is a bug and your results are untrustworthy.
[2021-05-04 07:37:49] <EROR>
[2021-05-04 07:37:49] <EROR> To help resolve this error run in debug mode and send the log file to
[2021-05-04 07:37:49] <EROR> https://github.com/luntergroup/octopus/issues.
[2021-05-04 07:37:49] <INFO> ------------------------------------------------------------------------

Version

$ octopus --version
octopus version 0.7.3 (develop d8d26b6b)
Target: x86_64 Linux 3.10.0-1160.24.1.el7.x86_64
SIMD extension: AVX2
Compiler: GNU 10.2.0
Boost: 1_76

Command Command line to install octopus:

octopus/scripts/install.py --forests

Command line to run octopus:

/gpfs/software/octopus/bin/octopus \
--reference $REF \
--threads 40 \
--debug \
--disable-denovo-variant-discovery \
--source-candidates-file variants_list_1_20.txt \
--reads-file bamlist_1_20.txt \
--bamout octopus_out/1_20_minibams \
--filter-expression 'MQ < 20 | MP < 20 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1' \
--output octopus_out/1_20.octopus.vcf

Additional context This is a population calling analysis with 20 individuals following the steps outlined here. I first called each sample individually, and then jointly called them in groups of 10. I'm now combining two groups of 10 for another round of join calling using the previously discovered set of variants.

All other steps in the analysis have finished successfully, but this step consistently fails.

Any suggestions would be greatly appreciated! Thanks, Dave

dancooke commented 3 years ago

Thanks for the bug report. Please can you install f18f97d05641ee944d799fc4de08d941a79ad562 on branch fix/176 and then rerun your failing command with the new --keep-temporary-files option. This will prevent the temporary directory being deleted. If you could then send the temporary file HiC_scaffold_2847_0-25000_temp.bcf, hopefully that will reveal the problem. In addition, can you try increasing your soft maximum open file limit before running the command (e.g., ulimit -n 50000) - I don't think this should be causing the problem but good to rule it out.

davidecarlson commented 3 years ago

Hi Dan, Thanks for your response. I've pulled and built the fix/176 branch, and also made the ulimit change. I'm now rerunning the job, and I'll attach the temporary file, assuming the job fails again. Best, Dave

davidecarlson commented 3 years ago

Hi Dan, The run completed again and failed with, I think, the same error:

[2021-05-05 15:35:31] <DEBG> Writing 14 calls to output [2021-05-05 15:35:31] <DEBG> Writing completed task HiC_scaffold_999:0-23242 that finished in 17s [2021-05-05 15:35:31] <INFO> - 100% 6h 32m - [2021-05-05 15:35:31] <DEBG> Merging 6877 temporary VCF files [2021-05-05 15:38:01] <DEBG> Encountered an error whilst calling, attempting to cleanup [2021-05-05 15:38:01] <EROR> A program error has occurred: [2021-05-05 15:38:01] <EROR> [2021-05-05 15:38:01] <EROR> Encountered an exception during calling 'failed to open file [2021-05-05 15:38:01] <EROR> /gpfs/projects/HollisterGroup/datahome/rad-seq/gbs_combined/octopus-temp-2/HiC_scaffold_2847_0-25000_temp.bcf'. [2021-05-05 15:38:01] <EROR> This means there is a bug and your results are untrustworthy. [2021-05-05 15:38:01] <EROR> [2021-05-05 15:38:01] <EROR> To help resolve this error run in debug mode and send the log file to [2021-05-05 15:38:01] <EROR> https://github.com/luntergroup/octopus/issues. I have attached a zipped version of the HiC_scaffold_2847_0-25000_temp.bcf file. Please let me know if I can provide any additional information. Thanks! Dave HiC_scaffold_2847_0-25000_temp.bcf.zip

dancooke commented 3 years ago

Thanks. Looks like the bug is that some records are not being sorted:

HiC_scaffold_2847   8661    .   G   A,* 5718.55 .   AC=20,9;AN=40;DP=263;MQ=46;NS=20    GT:GQ:DP:MQ:PS:PQ   1|1:122:16:48:8652:100  1|0:71:5:40:8652:97 2|2:209:51:33:8652:6    1|0:168:16:52:8652:100  0|2:91:10:41:8652:7 1|0:99:15:51:8652:41    1|2:165:18:49:8652:6    1|0:271:18:40:8652:100  1|2:33:4:50:8652:16 1|0:46:2:37:8652:66 1|0:407:13:49:8652:100  1|0:168:13:41:8652:100  1|0:379:21:48:8652:100  1|1:30:1:54:8652:53 1|2:193:9:55:8652:100   1|2:72:10:53:8652:83    1|1:30:2:50:8652:50 1|1:32:2:32:8652:56 0|2:81:25:50:8652:100   2|0:297:12:56:8652:6
HiC_scaffold_2847   8660    .   CG  TA,*    1041.78 .   AC=7,22;AN=40;DP=269;MQ=45;NS=20    GT:GQ:DP:MQ:PS:PQ   2|2:122:16:48:8652:100  2|0:71:5:40:8652:97 2|2:209:54:34:8652:6    2|0:168:16:52:8652:100  0|1:91:11:40:8652:7 2|0:99:15:51:8652:41    2|1:165:18:49:8652:6    2|0:271:19:39:8652:100  2|1:33:4:50:8652:16 2|0:46:2:37:8652:66 2|0:407:14:47:8652:100  2|0:168:13:41:8652:100  2|0:379:21:48:8652:100  2|2:30:1:54:8652:53 2|1:193:9:55:8652:100   2|1:72:10:53:8652:83    2|2:30:2:50:8652:50 2|2:32:2:32:8652:56 0|1:81:25:50:8652:100   1|0:297:12:56:8652:6

Could you try just calling contig HiC_scaffold_2847 with a single thread:

$ /gpfs/software/octopus/bin/octopus \
    --reference $REF \
    --debug octopus_HiC_scaffold_2847_debug.log \
    -T HiC_scaffold_2847 \
    --disable-denovo-variant-discovery \
    --source-candidates-file variants_list_1_20.txt \
    --reads-file bamlist_1_20.txt \
    --disable-call-filtering \
    --output octopus_out/1_20.octopus.HiC_scaffold_2847.vcf

and post the full debug log please?

davidecarlson commented 3 years ago

I just ran the command you suggested on the single scaffold, and it completed successfully. The debug log is attached. Thanks! Dave octopus_HiC_scaffold_2847_debug.log.zip

dancooke commented 3 years ago

Thanks. Please could you report the variants found in the two source VCFs (listed in variants_list_1_20.txt) in the region HiC_scaffold_2847:8650-8670. First you'll need to compress and index them (bgzip and tabix), hen you can just post the output of bcftools view x.vcf.gz HiC_scaffold_2847:8650-8670 (replacing x with each VCF). For future reference, note that Octopus can write compressed VCF and it will automatically index the output - just use the extension .vcf.gz for --output.

davidecarlson commented 3 years ago

Thanks for the tip about automatically writing compressed output!

I've attached a zip file containing the two source vcfs showing the variants in the HiC_scaffold_2847:8650-8670 range.

Thanks again for your help and let me know if I can provide anything else! Best, Dave variants_1_20_HiC_scaffold_2847_8650-8670.vcf.zip

dancooke commented 3 years ago

Thanks. Please can you install ee37c64386fafd754daa8d1b5b13c134b218edf2 and try the failing command again?

davidecarlson commented 3 years ago

I've pulled changes and am recompiling. I'll rerun the analysis and let you know how it goes. Thanks! Dave

davidecarlson commented 3 years ago

Happy to report that the analysis finished successfully this time. I'll go ahead and close the ticket. Really appreciate the help! Best, Dave

davidecarlson commented 3 years ago

Hi Dan, Sorry to reopen this ticket, but I received another error similar to the original one. This occurred during the same analysis as discussed previously, after adding another 10 samples into the population calling. Here is the end of the debug log:

[2021-05-14 15:31:36] Writing 9 calls to output [2021-05-14 15:31:36] Writing completed task HiC_scaffold_997:0-23294 that finished in 11s [2021-05-14 15:31:36] Writing completed task HiC_scaffold_998:0-23282 that finished in 53s [2021-05-14 15:31:36] Writing 23 calls to output [2021-05-14 15:31:36] Writing completed task HiC_scaffold_999:0-23242 that finished in 11s [2021-05-14 15:31:36] - 100% 1d 1h - [2021-05-14 15:31:36] Merging 6877 temporary VCF files [2021-05-14 15:35:30] Encountered an error whilst calling, attempting to cleanup [2021-05-14 15:35:31] A program error has occurred: [2021-05-14 15:35:31] [2021-05-14 15:35:31] Encountered an exception during calling 'failed to open file [2021-05-14 15:35:31] /gpfs/projects/HollisterGroup/datahome/rad-seq/gbs_combined/octopus-temp-5/HiC_scaffold_2_0-107610929_temp.bcf'. [2021-05-14 15:35:31] This means there is a bug and your results are untrustworthy. [2021-05-14 15:35:31] [2021-05-14 15:35:31] To help resolve this error run in debug mode and send the log file to [2021-05-14 15:35:31] https://github.com/luntergroup/octopus/issues. [2021-05-14 15:35:31] ------------------------------------------------------------------------

I proceeded to rerun the analysis with a single thread on only the failing scafffold (it's quite large). This time it failed part of the way through the filtering step. Here is the last several lines of the log:

[2021-05-22 16:26:32] Starting Call Set Refinement (CSR) filtering [2021-05-22 16:26:34] ----------------------------------------------------------------------------- [2021-05-22 16:26:34] current | | time | estimated [2021-05-22 16:26:34] position | completed | taken | ttc [2021-05-22 16:26:34] ----------------------------------------------------------------------------- [2021-05-22 16:27:10] HiC_scaffold_2:1080327 1.0% 36s 1h [2021-05-22 16:28:05] HiC_scaffold_2:2157992 2.0% 1m 31s 1h 28m [2021-05-22 16:28:43] HiC_scaffold_2:3251064 3.0% 2m 9s 1h 15m [2021-05-22 16:29:32] HiC_scaffold_2:4334181 4.0% 2m 58s 1h 15m [2021-05-22 16:30:14] HiC_scaffold_2:5411649 5.0% 3m 39s 1h 12m [2021-05-22 16:30:54] HiC_scaffold_2:6491855 6.0% 4m 20s 1h 10m [2021-05-22 16:31:35] HiC_scaffold_2:7568569 7.0% 5m 1s 1h 8m [2021-05-22 16:32:30] HiC_scaffold_2:8644875 8.0% 5m 56s 1h 10m [2021-05-22 16:33:14] HiC_scaffold_2:9743700 9.1% 6m 39s 1h 8m [2021-05-22 16:34:09] HiC_scaffold_2:10821616 10.1% 7m 35s 1h 9m [2021-05-22 16:35:06] HiC_scaffold_2:11898022 11.1% 8m 31s 1h 10m [2021-05-22 16:35:54] HiC_scaffold_2:12974564 12.1% 9m 20s 1h 9m [2021-05-22 16:36:40] HiC_scaffold_2:14053322 13.1% 10m 6s 1h 8m [2021-05-22 16:37:25] HiC_scaffold_2:15138588 14.1% 10m 51s 1h 7m [2021-05-22 16:38:21] HiC_scaffold_2:16216141 15.1% 11m 47s 1h 7m [2021-05-22 16:39:13] HiC_scaffold_2:17293950 16.1% 12m 39s 1h 7m [2021-05-22 16:40:12] HiC_scaffold_2:18370436 17.1% 13m 37s 1h 7m [2021-05-22 16:41:07] HiC_scaffold_2:19446626 18.1% 14m 32s 1h 7m [2021-05-22 16:42:05] HiC_scaffold_2:20532055 19.1% 15m 31s 1h 7m [2021-05-22 16:42:56] HiC_scaffold_2:21611592 20.1% 16m 22s 1h 6m [2021-05-22 16:43:50] HiC_scaffold_2:22693038 21.1% 17m 16s 1h 5m [2021-05-22 16:44:55] HiC_scaffold_2:23773127 22.1% 18m 21s 1h 5m [2021-05-22 16:45:57] HiC_scaffold_2:24863003 23.1% 19m 23s 1h 5m [2021-05-22 16:46:56] HiC_scaffold_2:25939142 24.1% 20m 22s 1h 5m [2021-05-22 16:47:56] HiC_scaffold_2:27025121 25.1% 21m 22s 1h 4m [2021-05-22 16:48:37] HiC_scaffold_2:28103090 26.1% 22m 3s 1h 3m [2021-05-22 16:49:31] HiC_scaffold_2:29185090 27.1% 22m 56s 1h 2m [2021-05-22 16:50:17] HiC_scaffold_2:30261231 28.1% 23m 43s 1h 1m [2021-05-22 16:51:08] HiC_scaffold_2:31337464 29.1% 24m 34s 1h [2021-05-22 16:52:03] HiC_scaffold_2:32415108 30.1% 25m 29s 59m 52s [2021-05-22 16:53:05] HiC_scaffold_2:33491271 31.1% 26m 31s 59m 25s [2021-05-22 16:54:06] HiC_scaffold_2:34567440 32.1% 27m 32s 58m 53s [2021-05-22 16:54:59] HiC_scaffold_2:35658377 33.1% 28m 25s 58m 2s [2021-05-22 16:55:42] HiC_scaffold_2:36734671 34.1% 29m 8s 56m 53s [2021-05-22 16:56:29] HiC_scaffold_2:37818303 35.1% 29m 55s 55m 51s [2021-05-22 16:57:24] HiC_scaffold_2:38897429 36.1% 30m 50s 55m 5s [2021-05-22 16:57:25] - 100% 30m 51s - [2021-05-22 16:57:26] A program error has occurred: [2021-05-22 16:57:26] [2021-05-22 16:57:26] Encountered an exception during calling 'BadRegion'. This means [2021-05-22 16:57:26] there is a bug and your results are untrustworthy. [2021-05-22 16:57:26] [2021-05-22 16:57:26] To help resolve this error run in debug mode and send the log file to [2021-05-22 16:57:26] https://github.com/luntergroup/octopus/issues. [2021-05-22 16:57:26] ------------------------------------------------------------------------

And here is last lines of the debug log:

[2021-05-22 16:57:24] There are 4354 reads in HiC_scaffold_2:38906238-38912184 after filtering [2021-05-22 16:57:24] Measuring block HiC_scaffold_2:38906338-38906599 containing 5 calls [2021-05-22 16:57:24] Request HiC_scaffold_2:38906684-38906735 is already cached [2021-05-22 16:57:24] Measuring block HiC_scaffold_2:38906684-38906735 containing 4 calls [2021-05-22 16:57:25] Request HiC_scaffold_2:38906797-38907484 is already cached [2021-05-22 16:57:25] Measuring block HiC_scaffold_2:38906797-38907484 containing 9 calls [2021-05-22 16:57:25] Request HiC_scaffold_2:38907798-38907997 is already cached [2021-05-22 16:57:25] Measuring block HiC_scaffold_2:38907798-38907997 containing 6 calls [2021-05-22 16:57:25] Request HiC_scaffold_2:38908136-38908331 is already cached [2021-05-22 16:57:25] Measuring block HiC_scaffold_2:38908136-38908331 containing 2 calls [2021-05-22 16:57:25] Request HiC_scaffold_2:38909341-38909391 is already cached [2021-05-22 16:57:25] - 100% 30m 51s - [2021-05-22 16:57:25] Encountered an error whilst filtering, attempting to cleanup [2021-05-22 16:57:26] A program error has occurred: [2021-05-22 16:57:26] [2021-05-22 16:57:26] Encountered an exception during calling 'BadRegion'. This means [2021-05-22 16:57:26] there is a bug and your results are untrustworthy. [2021-05-22 16:57:26] [2021-05-22 16:57:26] To help resolve this error run in debug mode and send the log file to [2021-05-22 16:57:26] https://github.com/luntergroup/octopus/issues. [2021-05-22 16:57:26] ------------------------------------------------------------------------

I've attached a bcf file of the unfiltered variants. I've also attached a partial (incomplete) vcf file of the filtered variants, that should show where the failure occurred.

Any idea what could be causing this? Please let me know if I can provide any additional info. Thanks! Dave

1_40.octopus.vcf.gz

1_40.octopus.unfiltered.bcf.zip

dancooke commented 3 years ago

Looks like another record sorting issue:

HiC_scaffold_2  38909372    .   CCCCC   TCCCCT  246.06  .   AC=3;AN=38;DP=29;MQ=58;NS=14    GT:GQ:DP:MQ:PS:PQ   .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:78   .|.:55:1:60:38909342:100    .|1:18:2:60:38909342:100    .|.:54:1:60:38909342:100    .|0:9:0:0:38909342:78   .|.:61:5:53:38909342:100    .|.:14:3:60:38909342:36 .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:7.|0:9:6:59:38909342:78  .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:78   .|0:9:0:0:38909342:78   .|0:10:0:0:38909342:81  .|0:9:0:0:38909342:78   .|0:10:0:0:38909342:81  1|1:17:1:60:38909342:100    .|0:10:0:0:38909342:81  .|0:9:0:0:38909342:78   .|0:13:0:0:38909342:84  .|0:9:0:0:38909342:78   .|0:13:0:0:38909342:90  0|0:31:2:60:38909342:100    0|0:30:3:60:38909342:100    .|0:10:0:0:38909342:81  0|.:55:1:60:38909342:60 .|0:10:0:0:38909342:81  .|0:52:1:60:38909342:100    .|0:9:0:0:38909342:78   .|0:30:0:0:38909342:70  .|0:9:0:0:38909342:78   .|0:13:0:0:38909342:84  .|0:4:1:60:38909342:88  .|0:13:0:0:38909342:84  .|.:55:1:60:38909342:100    .|0:9:0:0:38909342:78   .|0:55:1:60:38909342:100
HiC_scaffold_2  38909363    .   C   T   57.29   .   AC=1;AN=80;DP=29;MQ=58;NS=14    GT:GQ:DP:MQ:PS:PQ   0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:87:1:60:38909342:100    0|0:114:2:60:38909342:100   0|0:69:1:60:38909342:100    0|0:68:0:0:38909342:78  0|0:81:5:53:38909342:100    0|0:114:3:60:38909342:36    0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:68:6:59:38909342:78 0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:68:0:0:38909342:78  0|0:71:0:0:38909342:81  0|0:68:0:0:38909342:78  0|0:71:0:0:38909342:81  0|0:180:1:60:38909342:100   0|0:71:0:0:38909342:81  0|0:68:0:0:38909342:78  0|0:74:0:0:38909342:84  0|0:68:0:0:38909342:78  0|0:80:0:0:38909342:90  0|0:190:2:60:38909342:100   0|0:144:3:60:38909342:100   0|0:71:0:0:38909342:81  0|1:57:1:60:38909342:60 0|0:71:0:0:38909342:81  0|0:95:1:60:38909342:100    0|0:68:0:0:38909342:78  0|0:59:0:0:38909342:70  0|0:68:0:0:38909342:78  0|0:74:0:0:38909342:84  0|0:77:1:60:38909342:88 0|0:74:0:0:38909342:84  0|0:75:1:60:38909342:100    0|0:68:0:0:38909342:78  0|0:87:1:60:38909342:100

That's certainly a strange one - these records don't even overlap. Please could you send the full debug log for the single-thread run?

davidecarlson commented 3 years ago

Hi Dan, Sure. However, the full debug log appears to be too large, even when compressed, to attach to the message. Is there another way I can get it to you? Thanks! Dave

dancooke commented 3 years ago

I just shared a DropBox with you (david.carlson@stonybrook.edu) where you can upload the log.

davidecarlson commented 3 years ago

Thanks! I just added the full debug log there. Best, Dave

dancooke commented 3 years ago

Thanks. Please can you try 46b2a460ae6b333040264adc0c787f70ac91cfb9 on develop?

davidecarlson commented 3 years ago

Hi Dan, I built and ran version 46b2a46, and everything worked successfully on that scaffold. I'll go ahead and rerun the full analysis an confirm that it's working successfully. Best, Dave

davidecarlson commented 3 years ago

Happy to report that the full analysis on the entire assembly completed successfully. Thanks again! Dave