nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

process pool error in medaka tandem #484

Closed zhengxinchang closed 4 months ago

zhengxinchang commented 6 months ago

Medaka is a Research Release.

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Please ensure that you are using the most recent version of medaka before filing a bug report. The most recent version can be found on the release page. If you are not using the most recent release, and file a issue regardless the most likely response from our developers will be to ask you to first upgrade.

Please ensure also to provide the information below, not doing so will likely result in a request for the information.

Describe the bug A clear and concise description of what the bug is including the command that you have run.

Logging Please attach any relevant logging messages. (Use ``` before and after code blocks).

Environment (if you do not have a GPU, write No GPU):

Additional context Add any other context about the problem here.

Hi, developers,

I've recently come across an issue while utilizing medaka tandem. The error message that appeared is as follows:

[10:02:53 - TR] Encountered exception whilst processing tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap0: A process in the process pool was terminated abruptly while the future was running or pending.

The command is:

singularity exec -f -B /run:/run  --bind /stornext/:/stornext/ /softwares/medaka/medaka.sif    medaka tandem  --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz  input.bam hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.small   male test_out

Thank you in advance for reaching out and please let me know if you need anything from my end.

Sincerely, Xinchang

mwykes commented 6 months ago

Hi there, Thanks for reporting this - may I ask a few quesions to help diagnose the issue? 1) Were you running this just on on region (chr1:1693710-1694380) or several regions? If you were running on several regions, did you see the same error message for all regions, or just one? 3) How did you install medaka and abpoa into your singularity container? 4) What is the result of running the abpoa python example inside your singularity container? 5) Would you be willing to share your input files (or perhaps just a subset of the bed/bam containing the problematic region(s). This can be done privately if necessary.

zhengxinchang commented 6 months ago

Of course, thank you for your relpy!

  1. Were you running this just on on region (chr1:1693710-1694380) or several regions? If you were running on several regions, did you see the same error message for all regions, or just one? A:I was running in multiple regions and see several error meesages for differrent regions.

  2. How did you install medaka and abpoa into your singularity container? I first installed medaka using pip pip3 install medaka and then installed abopa using pip3 install pyabpoa. The verions of medaka is 1.11.3 and the version of pyabpoa is 1.4.2

  3. What is the result of running the abpoa python example inside your singularity container? Abpoa seems to be working in my container. The screenshot of the output is as follows: image

  4. Would you be willing to share your input files (or perhaps just a subset of the bed/bam containing the problematic region(s). This can be done privately if necessary.

Of course, one of the input files I used was downloaded in 1000G website with link

https://s3.amazonaws.com/1000g-ont/UltraLong/minimap2_2.24_alignment_data/HG00233_R10_UL/HG00233_R10_UL.ULK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam https://s3.amazonaws.com/1000g-ont/UltraLong/minimap2_2.24_alignment_data/HG00233_R10_UL/HG00233_R10_UL.ULK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam.bai

The docker image that I built can be found at here

The model file was downloaded at here

The region file was downloaded at here

Please let me know if you need more information.

Thanks Xinchang

TanyaDvorkina commented 6 months ago

Hi Xinchang,

Thank you for providing the data!

Thank you!

Tatiana

zhengxinchang commented 6 months ago

Hi Tatiana,

Thank you for reaching out. I still encontered this problem when I use 1 region and 1 thread. the log is :

INFO:    User not listed in /etc/subuid, trying root-mapped namespace
INFO:    Using fakeroot command combined with root-mapped namespace
INFO:    unknown argument ignored: lazytime
[16:30:20 - TR] Running medaka tr with options: /usr/local/bin/medaka tandem --threads 1 --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz  $data/HG00142.LSK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam $data/reference/hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.one male test_one_region_one_thread
[16:30:21 - TR] tr_chr1_9975_10498_pad_9965_10508_fwd_hap0: Retrieved too few reads (0 < 3)
[16:30:21 - TR] Created 0 consensus with 0 alignments.
[16:30:21 - TR] Writing trimmed reads to poa draft medaka input bam for 0 to test_one_region_one_thread/trimmed_reads_to_poa.bam.
[16:30:21 - TR] Writing poa consensus sequences to test_one_region_one_thread/poa.fasta.
[16:30:21 - TR] Writing trimmed reads to test_one_region_one_thread/trimmed_reads.fasta.
[16:30:21 - TR] Writing reference chunks to test_one_region_one_thread/ref_chunks.fasta.
[16:30:21 - BAM2VCF] Writing variants to test_one_region_one_thread/poa_to_ref.TR.vcf
[16:30:53 - TR] Running medaka consensus.
2023-12-22 16:30:54.066016: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-22 16:30:55.138438: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2023-12-22 16:30:55.138572: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-22 16:30:55.243554: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-22 16:31:01.585984: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2023-12-22 16:31:01.587078: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2023-12-22 16:31:01.587115: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[16:31:04 - Predict] Setting tensorflow inter/intra-op threads to 1/1.
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/medaka/prediction.py", line 114, in predict
    bam_regions = medaka.common.get_bam_regions(
  File "/usr/local/lib/python3.10/dist-packages/medaka/common.py", line 706, in get_bam_regions
    with pysam.AlignmentFile(bam) as bam_fh:
  File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 997, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/medaka", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/medaka/medaka.py", line 814, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/medaka/tandem.py", line 993, in main
    _ = fut.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
Command exited with non-zero status 1
        Command being timed: "singularity exec -f -B /run:/run --bind /stornext/:/stornext/ $data/softwares/medaka/medaka.1.11.3.sif medaka tandem --threads 1 --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz $data/HG00142.LSK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam $data/reference/hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.one male test_one_region_one_thread"
        User time (seconds): 31.46
        System time (seconds): 8.66
        Percent of CPU this job got: 86%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 870688
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 2915
        Minor (reclaiming a frame) page faults: 2280558
        Voluntary context switches: 183714
        Involuntary context switches: 283
        Swaps: 0
        File system inputs: 1612704
        File system outputs: 104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1
start: 2023-12-22 16:30:18
end: 2023-12-22 16:31:04
duration: 46s

The reference I used can be found at here.

Thank you! Xinchang

TanyaDvorkina commented 6 months ago

Thank you for quick response!

This log shows a different error, not the one from your first message. Could you please run on a bed-file with information for region chr1:1693710-1694380 from your first message and send me the log? For now I downloaded your docker and I can't reproduce the error on my side.

zhengxinchang commented 5 months ago

Dear Tatiana,

Very sorry for late response, and now I got the result(errors) for the location you provided:

The command for this test is:

singularity exec -f -B /run:/run --bind /stornext/:/stornext/ ./medaka/medaka.1.11.3.sif medaka tandem --threads 1 --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz ./1000g_ont_R10/HG00142.LSK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam ./workspace/reference/hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.one2 male test_one_region_one_thread

error messages are:

INFO:    User not listed in /etc/subuid, trying root-mapped namespace
INFO:    Using fakeroot command combined with root-mapped namespace
INFO:    underlay of /etc/localtime required more than 50 (90) bind mounts
[21:10:40 - TR] Running medaka tr with options: /usr/local/bin/medaka tandem --threads 1 --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz ./1000g_ont_R10/HG00142.LSK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam ./workspace/reference/hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.one2 male test_one_region_one_thread
[21:10:40 - root] The path test_one_region_one_thread exists. Results will be overwritten.
[21:10:41 - TR] Created 2 consensus with 2 alignments.
[21:10:41 - TR] Writing trimmed reads to poa draft medaka input bam for 2 to test_one_region_one_thread/trimmed_reads_to_poa.bam.
[21:10:41 - TR] Writing poa consensus sequences to test_one_region_one_thread/poa.fasta.
[21:10:41 - TR] Writing trimmed reads to test_one_region_one_thread/trimmed_reads.fasta.
[21:10:41 - TR] Writing reference chunks to test_one_region_one_thread/ref_chunks.fasta.
[21:10:41 - BAM2VCF] Writing variants to test_one_region_one_thread/poa_to_ref.TR.vcf
[21:12:16 - TR] Running medaka consensus.
2024-01-08 21:12:16.754478: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 21:12:20.242483: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2024-01-08 21:12:20.242581: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-01-08 21:12:20.546875: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 21:12:26.119317: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2024-01-08 21:12:26.120191: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2024-01-08 21:12:26.120221: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[21:12:29 - Predict] Setting tensorflow inter/intra-op threads to 1/1.
[21:12:29 - Predict] Processing region(s): tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap1:0-692 tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap2:0-690
[21:12:29 - Predict] Using model: res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz.
2024-01-08 21:12:29.887063: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/.singularity.d/libs
2024-01-08 21:12:29.887664: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2024-01-08 21:12:29.887722: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c81o-24.sug.hgsc.bcm.edu): /proc/driver/nvidia/version does not exist
[21:12:29 - BAMFile] Creating pool of 16 BAM file sets.
[21:12:29 - Predict] Processing 2 short region(s).
2024-01-08 21:12:29.985225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[21:12:32 - MdlStrTF] Model <keras.engine.sequential.Sequential object at 0x7efe10df1ae0>
[21:12:32 - MdlStrTF] loading weights from /tmp/tmp2ome2jjw/model/variables/variables (using expect partial)
[21:12:32 - Sampler] Initializing sampler for consensus of region tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap1:0-692.
[21:12:32 - Sampler] Initializing sampler for consensus of region tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap2:0-690.
[21:12:32 - PWorker] Running inference for 0.0M draft bases.
[21:12:32 - Feature] Processed tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap1:0.0-691.0 (median depth 30.0)
[21:12:32 - Sampler] Took 0.10s to make features.
[21:12:32 - Feature] Processed tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap2:0.0-689.0 (median depth 19.0)
[21:12:32 - Sampler] Took 0.12s to make features.
[21:12:38 - PWorker] Batches in cache: 2.
[21:12:44 - PWorker] Batches in cache: 1.
[21:12:44 - PWorker] 32.3% Done (0.0/0.0 Mbases) in 11.6s
[21:12:44 - PWorker] Processed 2 batches
[21:12:44 - PWorker] All done, 0 remainder regions.
[21:12:44 - Predict] Finished processing all regions.
[21:12:44 - MdlStrTF] Successfully removed temporary files from /tmp/tmp2ome2jjw.
[21:12:44 - TR] Running medaka stitch.
[21:12:44 - DataIndx] Loaded 1/1 (100.00%) sample files.
[21:12:44 - Stitcher] Stitching tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap1:0-692
[21:12:44 - Stitcher] Stitching tr_chr1_1693710_1694380_pad_1693700_1694390_fwd_hap2:0-690
[21:12:44 - TR] Medaka consensus sequences written to test_one_region_one_thread/consensus.fasta
[21:12:44 - BAM2VCF] Writing variants to test_one_region_one_thread/medaka_to_ref.TR.vcf
        Command being timed: "singularity exec -f -B /run:/run --bind /stornext/:/stornext/ /stornext/snfs4/next-gen/scratch/zhengxc/workspace/softwares/medaka/medaka.1.11.3.sif medaka tandem --threads 1 --model res_medaka_tandem_r1041_e82_400bps_sup_v420.tar.gz ./1000g_ont_R10/HG00142.LSK114.R10.dorado034.sup.5mCG_5hmCG.all.phased.bam ./workspace/reference/hg38_allchr_fixchrMT.fa ./adotto_TRregions_v1.2.bed.one2 male test_one_region_one_thread"
        User time (seconds): 95.30
        System time (seconds): 14.44
        Percent of CPU this job got: 70%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:35.63
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 873032
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3328
        Minor (reclaiming a frame) page faults: 1835676
        Voluntary context switches: 61716
        Involuntary context switches: 222
        Swaps: 0
        File system inputs: 7572088
        File system outputs: 14336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

It seems the concurrency error disappeared, but another error with missing libraries occurred.

Please let me know if you need any other information.

Sincerely, Xinchang

cjw85 commented 5 months ago

This missing library errors can be safely if ignored if you are not trying to use medaka with a GPU device; they result from installing the GPU version of medaka whilst not having the correct GPU libraries installed.

zhengxinchang commented 5 months ago

Thank you for the response. Is there a way to disable Madaka using GPU with some command-line options(or some environment variables can do this)? It is hard for us to shift cluster nodes with out a GPU information.

cjw85 commented 5 months ago

The CPU-only version of medaka can be installed with the medaka-cpu package instead of the standard medaka one.

zhengxinchang commented 5 months ago

Thank you! will try it.

Best regards Xinchang