k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.12k stars 213 forks source link

Error in back propagation #969

Closed HalflingWizard closed 2 years ago

HalflingWizard commented 2 years ago

The FSA I'm working on, has these properties: "Valid|Nonempty|TopSorted|TopSortedAndAcyclic|ArcSorted|ArcSortedAndDeterministic|MaybeAccessible|MaybeCoaccessible"

when I use get_tot_scores to get the scores of the model, I can call backward method with no problems. but when I call get_arc_post, get_backward_scores or get_forward_scores, (to get all paths, not only the best one.) an error occurs when I want to backpropagate.

for example, the following is the stack trace when I use get_arc_post:

/usr/local/lib/python3.7/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    130     Variable._execution_engine.run_backward(
    131         tensors, grad_tensors_, retain_graph, create_graph,
--> 132         allow_unreachable=True)  # allow_unreachable flag
    133 
    134 

/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py in apply(self, *args)
     87     def apply(self, *args):
     88         # _forward_cls is defined by derived class
---> 89         return self._forward_cls.backward(self, *args)  # type: ignore
     90 
     91 

/usr/local/lib/python3.7/dist-packages/k2/autograd.py in backward(ctx, arc_post_grad)
    334         incoming_arcs = fsas._get_incoming_arcs()
    335         forward_scores_grad, backward_scores_grad = bprop_func(
--> 336             fsas.arcs, incoming_arcs, arc_post_grad)
    337         arc_scores_grad = arc_post_grad.detach().clone()
    338 

I would appreciate if anyone could suggest a solution to this problem.

csukuangfj commented 2 years ago

an error occurs when I want to backpropagate.

Do you have detailed error logs?

HalflingWizard commented 2 years ago

Do you have detailed error logs?

Here are the errors:

Detailed error log when I use `get_forward_scores` ```python RuntimeError Traceback (most recent call last) [](https://localhost:8080/#) in () 1 score = fs.sum() ----> 2 (-score).backward() 3 frames [/usr/local/lib/python3.7/dist-packages/torch/tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph) 222 223 def register_hook(self, hook): [/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 130 Variable._execution_engine.run_backward( 131 tensors, grad_tensors_, retain_graph, create_graph, --> 132 allow_unreachable=True) # allow_unreachable flag 133 134 [/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py](https://localhost:8080/#) in apply(self, *args) 87 def apply(self, *args): 88 # _forward_cls is defined by derived class ---> 89 return self._forward_cls.backward(self, *args) # type: ignore 90 91 [/usr/local/lib/python3.7/dist-packages/k2/autograd.py](https://localhost:8080/#) in backward(ctx, forward_scores_grad) 187 entering_arcs=entering_arcs, 188 forward_scores=forward_scores, --> 189 forward_scores_deriv=forward_scores_grad) 190 191 return ( RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful: gdb --args python /path/to/your/code.py (You can use `gdb` to debug the code. Please consider compiling a debug version of k2.). If you are unable to fix it, please open an issue at: https://github.com/k2-fsa/k2/issues/new ```
Detailed error log when I use `get_backward_scores` ```python RuntimeError Traceback (most recent call last) [](https://localhost:8080/#) in () 1 score = bs.sum() ----> 2 (-score).backward() 3 frames [/usr/local/lib/python3.7/dist-packages/torch/tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph) 222 223 def register_hook(self, hook): [/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 130 Variable._execution_engine.run_backward( 131 tensors, grad_tensors_, retain_graph, create_graph, --> 132 allow_unreachable=True) # allow_unreachable flag 133 134 [/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py](https://localhost:8080/#) in apply(self, *args) 87 def apply(self, *args): 88 # _forward_cls is defined by derived class ---> 89 return self._forward_cls.backward(self, *args) # type: ignore 90 91 [/usr/local/lib/python3.7/dist-packages/k2/autograd.py](https://localhost:8080/#) in backward(ctx, backward_scores_grad) 258 log_semiring=log_semiring, 259 backward_scores=backward_scores, --> 260 backward_scores_deriv=backward_scores_grad) 261 262 return ( RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful: gdb --args python /path/to/your/code.py (You can use `gdb` to debug the code. Please consider compiling a debug version of k2.). If you are unable to fix it, please open an issue at: https://github.com/k2-fsa/k2/issues/new ```
Detailed error log when I use `get_arc_post` ```python RuntimeError Traceback (most recent call last) [](https://localhost:8080/#) in () 1 score = ap.sum() ----> 2 (-score).backward() 3 frames [/usr/local/lib/python3.7/dist-packages/torch/tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph) 222 223 def register_hook(self, hook): [/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 130 Variable._execution_engine.run_backward( 131 tensors, grad_tensors_, retain_graph, create_graph, --> 132 allow_unreachable=True) # allow_unreachable flag 133 134 [/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py](https://localhost:8080/#) in apply(self, *args) 87 def apply(self, *args): 88 # _forward_cls is defined by derived class ---> 89 return self._forward_cls.backward(self, *args) # type: ignore 90 91 [/usr/local/lib/python3.7/dist-packages/k2/autograd.py](https://localhost:8080/#) in backward(ctx, arc_post_grad) 334 incoming_arcs = fsas._get_incoming_arcs() 335 forward_scores_grad, backward_scores_grad = bprop_func( --> 336 fsas.arcs, incoming_arcs, arc_post_grad) 337 arc_scores_grad = arc_post_grad.detach().clone() 338 RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful: gdb --args python /path/to/your/code.py (You can use `gdb` to debug the code. Please consider compiling a debug version of k2.). If you are unable to fix it, please open an issue at: https://github.com/k2-fsa/k2/issues/new ```

But I guess they are not what you were asking for... I should try again using debug mode, right?

danpovey commented 2 years ago

Is there any way to get the ipython stuff out of the way and just use the regular python prompt, and/or run it inside gdb, as in gdb --args python args (gdb) r ? I feel like that might be a confusing factor. Who knows what that is adding into the mix.

HalflingWizard commented 2 years ago

I used the regular python prompt as you suggested, and here is the stack trace for get_arc_post :

[F] /home/runner/work/k2/k2/k2/python/csrc/torch/torch_util.h:124:k2::Array1<U> k2::FromTorch(at::Tensor) [with T = float] Check failed: tensor.strides()[0] == 1 (0 vs. 1) Expected stride: 1. Given: 0

[ Stack-Trace: ]
/usr/local/lib/python3.7/dist-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x7f7aeb1fe45c]
/usr/local/lib/python3.7/dist-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x28f8a) [0x7f7aef2b7f8a]
/usr/local/lib/python3.7/dist-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x3e3db) [0x7f7aef2cd3db]
/usr/local/lib/python3.7/dist-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x5a967) [0x7f7aef2e9967]
/usr/local/lib/python3.7/dist-packages/_k2.cpython-37m-x86_64-linux-gnu.so(+0x206dc) [0x7f7aef2af6dc]
python3(_PyMethodDef_RawFastCallKeywords+0x264) [0x593784]
python3(_PyEval_EvalFrameDefault+0x3cf4) [0x515244]
python3(_PyFunction_FastCallDict+0x15a) [0x4bc98a]
python3(_PyEval_EvalFrameDefault+0x1f56) [0x5134a6]
python3(_PyEval_EvalCodeWithName+0x346) [0x549576]
python3(_PyFunction_FastCallDict+0x2e9) [0x4bcb19]
python3() [0x59c019]
python3(PyObject_Call+0x66) [0x595ef6]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so(torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)+0x183) [0x7f7b4778faa3]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so(+0x2bfed70) [0x7f7b39080d70]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)+0x14e0) [0x7f7b3907ca60]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)+0x4a0) [0x7f7b3907d670]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>)+0x490) [0x7f7b3907b310]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so(torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>)+0x3c) [0x7f7b4778723c]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&)+0xacd) [0x7f7b3907a36d]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so(torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&)+0x4e) [0x7f7b4778703e]
/usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so(THPEngine_run_backward(THPEngine*, _object*, _object*)+0xe3f) [0x7f7b4778810f]
python3(_PyMethodDef_RawFastCallKeywords+0x315) [0x593835]
python3() [0x548c51]
python3(_PyEval_EvalFrameDefault+0x12a1) [0x5127f1]
python3(_PyEval_EvalCodeWithName+0x346) [0x549576]
python3(_PyFunction_FastCallKeywords+0x37e) [0x593fce]
python3() [0x548ae9]
python3(_PyEval_EvalFrameDefault+0x411f) [0x51566f]
python3(_PyEval_EvalCodeWithName+0x346) [0x549576]
python3(_PyFunction_FastCallKeywords+0x37e) [0x593fce]
python3(_PyEval_EvalFrameDefault+0x8dc) [0x511e2c]
python3(_PyEval_EvalCodeWithName+0x346) [0x549576]
python3(PyEval_EvalCode+0x23) [0x604173]
python3() [0x5f5506]
python3(PyRun_FileExFlags+0x9c) [0x5f8c6c]
python3(PyRun_SimpleFileExFlags+0x196) [0x5f9206]
python3() [0x64faf2]
python3(_Py_UnixMain+0x2e) [0x64fc4e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f7b4bdfac87]
python3(_start+0x2a) [0x5b621a]

Traceback (most recent call last):
  File "/content/lm_wfst/code.py", line 211, in <module>
    (-score).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/usr/local/lib/python3.7/dist-packages/k2/autograd.py", line 336, in backward
    fsas.arcs, incoming_arcs, arc_post_grad)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

I hope it is as you wanted.

csukuangfj commented 2 years ago

I see what the issue is.

RuntimeError                              Traceback (most recent call last)
[<ipython-input-40-78c710214bf8>](https://localhost:8080/#) in <module>()
      1 score = ap.sum()
----> 2 (-score).backward()
[F] /home/runner/work/k2/k2/k2/python/csrc/torch/torch_util.h:124:k2::Array1<U> k2::FromTorch(at::Tensor) [with T = float] Check failed: tensor.strides()[0] == 1 (0 vs. 1) Expected stride: 1. Given: 0

PyTorch sets the stride of the gradient of ap to 0 and it uses a scalar to represent a tensor, which cannot be handled by k2's 1-D array.

Will make a PR to fix it.

csukuangfj commented 2 years ago

@HalflingWizard Please try #970.

You can replace /usr/local/lib/python3.7/dist-packages/k2/autograd.py with the file from #970

HalflingWizard commented 2 years ago

You can replace /usr/local/lib/python3.7/dist-packages/k2/autograd.py with the file from #970

Thanks, it works now.