Closed Yijun-Tian closed 1 year ago
Hi,
In my remora infer program, there are some warning messages, and the processing halted with an empty BAM file.
(remora) [user@centos nano_NOME-seq]$ remora infer from_pod5_and_bam negative.pod5 negative.bam --model train_results/model_best.pt --out-bam negative_infer.bam --device 0 Indexing BAM by read id: 12872 Reads [00:01, 8287.74 Reads/s] [22:16:44] Found 12872 BAM records and 12872 POD5 reads Inferring mods: 0%| | 0/12872 [00:00<?, ? Reads/s]/home/user/anaconda3/envs/remora/lib/python3.8/site-packages/remora/data_chunks.py:515: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback` (Triggered internally at ../third_party/nvfuser/csrc/manager.cpp:335.) model.forward(
I've tried
export PYTORCH_NVFUSER_DISABLE=fallback
before running. The warning messages are gone, but the infer output is still missing. Could anyone help?Thanks,
Seems the issue appears only when I tried to infer with my GPU. The CPU based infer. Is it because I installed the remora in a conda pip?
I have had several issues with conda installations and use venv
and pip
myself. I would recommend an installation without conda to test the issue.
Additionally I would recommend adding the --log-filename
and reporting the output as there may be some useful debug messages.
I have had several issues with conda installations and use
venv
andpip
myself. I would recommend an installation without conda to test the issue.Additionally I would recommend adding the
--log-filename
and reporting the output as there may be some useful debug messages.
Thanks, I will take a try with new installation method and keep update here
Hi, @marcus1487 The venv remora doesn't use GPU correct. Below is the logs.
DEBUG [09:54:51:MainProcess:MainThread:log.py:67] Command: """/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/bin/remora infer from_pod5_and_bam negative.pod5 negative.bam --model train_results/model_best.pt --out-bam negative_infer.bam --device 0 --log-filename log.txt"""
DEBUG [09:54:51:MainProcess:MainThread:model_util.py:422] Using torchscript model
DEBUG [09:54:52:MainProcess:MainThread:model_util.py:400] Loaded Remora model attrs
creation_date : 04/11/2023, 12:23:01
kmer_context_bases : [4, 4]
chunk_context : [50, 50]
base_pred : False
mod_bases : m
base_start_justify : False
offset : 0
model_params : {'size': 64, 'kmer_len': 9, 'num_out': 2}
num_motifs : 1
doc_string : Nanopore Remora model
model_version : 3
mod_long_names : ['5mC']
kmer_len : 9
chunk_len : 100
motifs : [('GC', 1)]
can_base : C
motif : ('GC', 1)
alphabet_str : loaded modified base model to call (alt to C): m=5mC
sig_map_refiner : Loaded 0-mer table with 0 central position.
[09:54:54] Found 12879 BAM records and 12879 POD5 reads
DEBUG [09:54:54:ExtractSignal_filler:MainThread:util.py:387] Starting ExtractSignal background filler
DEBUG [09:54:54:ExtractSignal_filler:MainThread:io.py:509] Reading from POD5 at negative.pod5
DEBUG [09:54:54:AddAlignments_0:MainThread:util.py:358] Starting AddAlignments worker
DEBUG [09:54:54:MainProcess:InferMods_0:util.py:358] Starting InferMods worker
DEBUG [09:54:54:PrepBatches_0:MainThread:util.py:358] Starting PrepBatches worker
DEBUG [09:54:56:ExtractSignal_filler:MainThread:io.py:521] Completed pod5 signal worker
DEBUG [09:54:56:ExtractSignal_filler:MainThread:io.py:536] Completed signal worker
DEBUG [09:54:56:ExtractSignal_filler:MainThread:util.py:398] Completed ExtractSignal background filler
DEBUG [09:54:59:MainProcess:InferMods_0:util.py:367] UNEXPECTED_ERROR in InferMods worker: 'The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[2, 0, 1]' is invalid for input of size 1536000
'.
Full traceback: Traceback (most recent call last):
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/util.py", line 364, in _mt_func
out_q.put(func(val, *args, **kwargs))
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/inference.py", line 245, in run_model
nn_out, labels, pos = remora_read.run_model(model)
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/data_chunks.py", line 515, in run_model
model.forward(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[2, 0, 1]' is invalid for input of size 1536000
**DEBUG [09:54:59:MainProcess:InferMods_0:util.py:367] UNEXPECTED_ERROR in InferMods worker: 'The following operation failed** in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: stack.size() >= frames.back().function->n_inputs INTERNAL ASSERT FAILED at "../torch/csrc/jit/runtime/interpreter.cpp":241, please report a bug to PyTorch.
'.
Full traceback: Traceback (most recent call last):
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/util.py", line 364, in _mt_func
out_q.put(func(val, *args, **kwargs))
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/inference.py", line 245, in run_model
nn_out, labels, pos = remora_read.run_model(model)
File "/mnt/raid0/Yijun_Tian/nano_NOME-seq/venv/lib/python3.8/site-packages/remora/data_chunks.py", line 515, in run_model
model.forward(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: stack.size() >= frames.back().function->n_inputs INTERNAL ASSERT FAILED at "../torch/csrc/jit/runtime/interpreter.cpp":241, please report a bug to PyTorch.
The log keeps looping between "DEBUG" and "Full traceback". Sounds like there is something wrong with Pytorch?
same issue here!
same issue here!
Hi @Puputnik, which linux are you using the remora. The issue occurs on my centos 7, and I am trying to test it on ubuntu 20.04.
same issue here!
Hi @Puputnik, which linux are you using the remora. The issue occurs on my centos 7, and I am trying to test it on ubuntu 20.04.
18.04.5
This may have something to do with the way the mutliprocessing works in the current version. We have moved the GPU processing to the main thread which seems a bit more stable. This will be included in the next release.
same issue here!
Hi @Puputnik, which linux are you using the remora. The issue occurs on my centos 7, and I am trying to test it on ubuntu 20.04.
18.04.5
@Puputnik @marcus1487, my tests on centos7 and ubuntu20.04 failed with the same logs when using GPU to infer. Hope there will be an easy fix in the next release.
Just fixed the error by rolling back to torch 1.13.1.
This may have something to do with the way the mutliprocessing works in the current version. We have moved the GPU processing to the main thread which seems a bit more stable. This will be included in the next release.
@marcus1487 @Puputnik Just fixed the error by rolling back to torch 1.13.1.
Version 2.1.0 has some changes which should help address these issues without rolling torch back. Please let me know if you continue to experience issues with remora inference.
Version 2.1.0 has some changes which should help address these issues without rolling torch back. Please let me know if you continue to experience issues with remora inference.
Hi @marcus1487, The GPU infer can start now, but terminated after 1 read calling? The initial Runtime error ("'[2, 0, 1]' is invalid for input of size xxx") persists after updating to remora-2.1.0:
>remora infer from_pod5_and_bam negative.pod5 negative.bam --model train_results/model_best.pt --out-bam negative_infer.bam --device 0
******************** WARNING [08:33:08:MainProcess:MainThread:model_util.py:253]: reverse signal attribute not found in model. Assuming False ********************
******************** WARNING [08:33:08:MainProcess:MainThread:refine_signal_map.py:258]: K-mer table provided, but not used. See rough rescaling options. ********************
Indexing BAM by read id: 29043 Reads [00:02, 13081.72 Reads/s]
[08:33:11] Extracting read IDs from POD5
[08:33:11] Found 12945 BAM records, 26970 POD5 reads, and 12945 in common
Inferring mods: 0%| | 1/12945 [00:08<30:55:50, 8.60s/ Reads, 0.17 Msamps/s]
Traceback (most recent call last):
File "/mnt/raid0/yijun_tian/ont_remora/venv/bin/remora", line 8, in <module>
sys.exit(run())
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/main.py", line 71, in run
cmd_func(args)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/parsers.py", line 1115, in run_infer_from_pod5_and_bam
infer_from_pod5_and_bam(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 360, in infer_from_pod5_and_bam
read_errs = run_model(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 243, in run_model
nn_out, labels, pos = remora_read.run_model(model)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/data_chunks.py", line 513, in run_model
output = model(sigs, enc_kmers).detach().cpu().numpy()
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[2, 0, 1]' is invalid for input of size 1572864
Here is my venv package list for the above error run:
>pip list
Package Version
------------------------ --------------------
attrs 22.2.0
certifi 2022.12.7
charset-normalizer 3.1.0
cmake 3.26.3
contourpy 1.0.7
cycler 0.11.0
Cython 0.29.34
filelock 3.11.0
fonttools 4.39.3
h5py 3.8.0
idna 3.4
importlib-resources 5.12.0
iso8601 1.1.0
Jinja2 3.1.2
joblib 1.2.0
jsonschema 4.17.3
kiwisolver 1.4.4
lib-pod5 0.1.16
lit 16.0.1
MarkupSafe 2.1.2
matplotlib 3.7.1
more-itertools 9.1.0
mpmath 1.3.0
networkx 3.1
numpy 1.24.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
ont-remora 2.1.0
packaging 23.1
pandas 2.0.0
parasail 1.3.4
Pillow 9.5.0
pip 20.0.2
pkg-resources 0.0.0
pkgutil-resolve-name 1.3.10
pod5 0.1.16
pyarrow 11.0.0
pyparsing 3.0.9
pyrsistent 0.19.3
pysam 0.21.0
python-dateutil 2.8.2
pytz 2023.3
requests 2.28.2
scikit-learn 1.2.2
scipy 1.10.1
seaborn 0.12.2
setuptools 44.0.0
six 1.16.0
sympy 1.11.1
tabulate 0.9.0
thop 0.1.1.post2209072238
threadpoolctl 3.1.0
toml 0.10.2
torch 2.0.0
tqdm 4.65.0
triton 2.0.0
typing-extensions 4.5.0
tzdata 2023.3
urllib3 1.26.15
vbz-h5py-plugin 1.0.1
wheel 0.40.0
zipp 3.15.0
@Yijun-Tian Was this model trained with the same version of torch? It seems like there is a model structure issue. I'm wondering if a model trained/saved with torch<2 is not compatible with inference using torch>=2. Could you try to train the model with the same version of torch (does not need to be trained to high accuracy to test that the model will run for now).
@marcus1487 I just trained the model with remora-2.1.0 and torch-2.0.0, repeated from data preparation steps. The error still persists. The inference stopped after 1 or 2 read calling:
(remora) poe@poe$ remora model train chunks.npz --model remora/models/ConvLSTM_w_ref.py --device 0 --output-path train_results
[10:21:14] Seed selected is 3812293298
[10:21:14] Loading dataset from Remora file
[10:21:15] Dataset loaded with labels: Counter({0: 190908, 1: 5069})
[10:21:15] Dataset summary:
num chunks : 195977
label distribution : Counter({0: 190908, 1: 5069})
base_pred : False
mod_bases : m
mod_long_names : ('5mC',)
kmer_context_bases : (4, 4)
chunk_context : (50, 50)
motifs : [('GC', 1)]
reverse_signal : False
chunk_extract_base_start : False
chunk_extract_offset : 0
sig_map_refiner : No Remora signal refine/map settings loaded
[10:21:15] Loading model
[10:21:15] Model structure:
network(
(sig_conv1): Conv1d(1, 4, kernel_size=(5,), stride=(1,))
(sig_conv2): Conv1d(4, 16, kernel_size=(5,), stride=(1,))
(sig_conv3): Conv1d(16, 64, kernel_size=(9,), stride=(3,))
(seq_conv1): Conv1d(36, 16, kernel_size=(5,), stride=(1,))
(seq_conv2): Conv1d(16, 64, kernel_size=(13,), stride=(3,))
(merge_conv1): Conv1d(128, 64, kernel_size=(5,), stride=(1,))
(lstm1): LSTM(64, 64)
(lstm2): LSTM(64, 64)
(fc): Linear(in_features=64, out_features=2, bias=True)
(dropout): Dropout(p=0.3, inplace=False)
(sig_bn1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(sig_bn2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(sig_bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(seq_bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(seq_bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(merge_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
[10:21:15] Params (k) 134.08 | MACs (M) 3663.72
[10:21:15] Preparing training settings
[10:21:18] Label distribution: Counter({0: 190908, 1: 5069})
[10:21:19] Train label distribution: Counter({0: 189000, 1: 5019})
[10:21:19] Held-out validation label distribution: Counter({0: 1908, 1: 50})
[10:21:19] Training set validation label distribution: Counter({0: 1889, 1: 50})
[10:21:19] Running initial validation
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.38it/s]
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 72.00it/s]
[10:21:20] Start training
Epochs: 32%|█████████▎ | 16/50 [01:50<03:54, 6.90s/it, acc_train=0.9902, acc_val=0.9780, loss_train=0.040839, loss_val=0.079555][10:23:10] No validation accuracy improvement after 5 epoch(s). Stopping training early.███████████████████████████████████████████████████| 190/190
Epochs: 32%|█████████▎ | 16/50 [01:50<03:54, 6.90s/it, acc_train=0.9902, acc_val=0.9780, loss_train=0.040839, loss_val=0.079555]
Epoch Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190
[10:23:10] Saving final model checkpoint
[10:23:10] Training complete
(remora) poe@poe$ remora infer from_pod5_and_bam negative.pod5 negative.bam --model train_results/model_best.pt --out-bam negative_infer.bam --device 0
******************** WARNING [10:23:29:MainProcess:MainThread:refine_signal_map.py:258]: K-mer table provided, but not used. See rough rescaling options. ********************
Indexing BAM by read id: 29043 Reads [00:02, 13322.97 Reads/s]
[10:23:31] Extracting read IDs from POD5
[10:23:32] Found 12945 BAM records, 26970 POD5 reads, and 12945 in common
Inferring mods: 0%| | 1/12945 [00:08<29:33:57, 8.22s/ Reads, 0.18 Msamps/s]
Traceback (most recent call last):
File "/mnt/raid0/yijun_tian/ont_remora/venv/bin/remora", line 8, in <module>
sys.exit(run())
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/main.py", line 71, in run
cmd_func(args)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/parsers.py", line 1115, in run_infer_from_pod5_and_bam
infer_from_pod5_and_bam(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 360, in infer_from_pod5_and_bam
read_errs = run_model(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 243, in run_model
nn_out, labels, pos = remora_read.run_model(model)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/data_chunks.py", line 513, in run_model
output = model(sigs, enc_kmers).detach().cpu().numpy()
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[2, 0, 1]' is invalid for input of size 1572864
(remora) poe@poe$
(remora) poe@poe$ remora infer from_pod5_and_bam positive.pod5 positive.bam --model train_results/model_best.pt --out-bam positive_infer.bam --device 0
******************** WARNING [10:23:47:MainProcess:MainThread:refine_signal_map.py:258]: K-mer table provided, but not used. See rough rescaling options. ********************
Indexing BAM by read id: 673 Reads [00:00, 11223.72 Reads/s]
[10:23:47] Extracting read IDs from POD5
[10:23:47] Found 468 BAM records, 2152 POD5 reads, and 468 in common
Inferring mods: 0%|▌ | 2/468 [00:05<21:48, 2.81s/ Reads, 0.10 Msamps/s]
Traceback (most recent call last):
File "/mnt/raid0/yijun_tian/ont_remora/venv/bin/remora", line 8, in <module>
sys.exit(run())
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/main.py", line 71, in run
cmd_func(args)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/parsers.py", line 1115, in run_infer_from_pod5_and_bam
infer_from_pod5_and_bam(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 360, in infer_from_pod5_and_bam
read_errs = run_model(
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/inference.py", line 243, in run_model
nn_out, labels, pos = remora_read.run_model(model)
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/remora/data_chunks.py", line 513, in run_model
output = model(sigs, enc_kmers).detach().cpu().numpy()
File "/mnt/raid0/yijun_tian/ont_remora/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[2, 0, 1]' is invalid for input of size 279552
(remora) poe@poe$ remora --version
Remora version: 2.1.0
(remora) poe@poe$
Same here.
same here
Hi,
In my remora infer program, there are some warning messages, and the processing halted with an empty BAM file.
I've tried
export PYTORCH_NVFUSER_DISABLE=fallback
before running. The warning messages are gone, but the infer output is still missing. Could anyone help?Thanks,