pytorch / functorch

functorch is JAX-like composable function transforms for PyTorch.
https://pytorch.org/functorch/
BSD 3-Clause "New" or "Revised" License
1.39k stars 102 forks source link

Exception handling in linux binaries seem off #916

Open zou3519 opened 2 years ago

zou3519 commented 2 years ago

This only happens on one of my machines. It does not happen in our CI machines. Could just be a me-problem.

Repro:

import torch
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)

Produces:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 366, in wrapped
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
RuntimeError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
Exception raised from maybe_wrap_dim_slow at ../c10/core/WrapDimMinimal.cpp:29 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10a018e612 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #1: c10::detail::maybe_wrap_dim_slow(long, long, bool) + 0x3d3 (0x7f10a017c023 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #2: at::functorch::_remove_batch_dim(at::Tensor const&, long, long, long) + 0x5e8 (0x7f0ff6088678 in /private/home/rzou/local/miniconda3/envs/py39/lib/p
ython3.9/site-packages/functorch/_C.so)
frame #3: <unknown function> + 0x23b502 (0x7f0ff608c502 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
frame #4: <unknown function> + 0x1ff6e2 (0x7f0ff60506e2 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
<omitting python frames>
frame #27: __libc_start_main + 0xf3 (0x7f10f1ae70b3 in /lib/x86_64-linux-gnu/libc.so.6)

I would expect the error message to look like the following:

>>> vmap(lambda x: x, out_dims=3)(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
malfet commented 2 years ago

Looks like a linker compatibility problem (i.e. when one c++ runtime does not know how to talk to another one or how to parse unwind instructions)

malfet commented 2 years ago

Though it works for me on Ubuntu-18.04, by running the following commands:

$ conda create -n py38-torch112-cpu python=3.8
$ conda activate py38-torch112-cpu
$  python3 -mpip install --pre torch==1.12 -f https://download.pytorch.org/whl/test/cpu/torch_test.html
$ pip install functorch-0.2.0-cp38-cp38-linux_x86_64.whl 
$ python
Python 3.8.13 (default, Mar 28 2022, 11:38:47) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
from f>>> from functorch import vmap
>>> x=torch.rand(2, 3, 5)
<stdin>:1: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:68.)
>>> vmap(lambda x: x, out_dims=3)(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx/users/nshulga/conda/envs/py38-torch112-cpu/lib/python3.8/site-packages/functorch/_src/vmap.py", line 366, in wrapped
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/fsx/users/nshulga/conda/envs/py38-torch112-cpu/lib/python3.8/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/fsx/users/nshulga/conda/envs/py38-torch112-cpu/lib/python3.8/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
zou3519 commented 2 years ago

There's also a related problem (which was discussed offline): the PyTorch cu102 binaries don't include the _ZNSt19basic_ostringstreamIcSt11char_traitsIcESaIcEEC1Ev symbol, but the PyTorch cpu/cu113/cu116 binaries do. On some systems, libstdc++.so.6 doesn't actually include this, so this leads to a "symbol missing error" on import functorch

malfet commented 2 years ago

Just to clarify:

$ c++filt _ZNSt19basic_ostringstreamIcSt11char_traitsIcESaIcEEC1Ev
std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()