torch.inverse and torch.lu_solve give wrong results for singular matrices

pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

https://pytorch.org

Other

84.48k stars 22.76k forks source link

torch.inverse and torch.lu_solve give wrong results for singular matrices #48572

Closed NJdevPro closed 2 years ago

NJdevPro commented 3 years ago

🐛 Bug

torch.lu_solve and torch.inverse give (completely wrong) results for singular matrices instead of returning an error.

To Reproduce

t1 = tensor([[1.,2,3],[-4.,.5,6]])
t2 = tensor([[1.,-2],[3.,4],[5.,-6]])
A = torch.matmul(t2,t1)
# det is zero or near zero: A is singular
A.det()

b = torch.normal(mean=0.0, std=1.0, size=(3,1))

# solve Ax = b
LU_A = torch.lu(A)
x = torch.lu_solve(b, *LU_A) # this should give an error message that A is singular instead of spitting out garbage

# Verification: should return a vector of zeros, but doesn't
torch.matmul(A, x) - b

Expected behavior

For singular/non-invertible matrices, lu_solve and inverse should return an error/exception instead of silently giving completely incorrect results. This can lead to dangerous calculations if the user doesn't have the mathematical background to realize that the result is simply false. Raising an error is the behavior of numpy.linalg.solve, btw, so for consistency, Torch should do the same.

Environment

PyTorch Version (e.g., 1.0): 1.7.0+cu101
OS (e.g., Linux): cloud install
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0]
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

cc @ezyang @gchanan @zou3519 @bdhirsh @vishwakftw @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr

xwang233 commented 3 years ago

If you are working on GPU,

torch.inverse for singular matrix is a known issue https://github.com/pytorch/pytorch/issues/46557, and was fixed in https://github.com/pytorch/pytorch/pull/46625. You may use a recent nightly build to get the correct torch.inverse behavior.

mruberry commented 3 years ago

Thank you for reporting this issue, @NJdevPro, we'll take a look soon!

nikitaved commented 3 years ago

There are common issues with most PyTorch routines which are inherited from LAPACK/MAGMA - not all of them handle rank-deficient cases.

I think if the user intentionally uses a LAPACK routine and is aware of its limitations, it is fine to just return a LAPACK error. Sufficient for torch.inverse, not necessarily for torch.lu_solve. If we are to introduce a flag to handle rank-deficient cases, then we need to support solve for general matrices. The algorithm could be something like this:

We want to solve Ax = b.
1. find r = rank(A)
2. find r linearly independent rows at indices r_rows and r linearly independent columns at indices r_cols.
    Could be done via the QR factorization (with pivoting, currently not in PyTorch) run on A and A^T.
3. solve A[r_rows, r_cols] y = b[r_rows, :] # can still fail if certain rows/columns are almost linearly-dependent.
4. set x[r_cols] = y, x[ [n] / r_cols] = 0.
5. test whether A x == b, otherwise report the system does not have a solution (b is not in range(A)).
6. the user should be aware that the solution is not always unique as any vector z = x + d with Ad = 0 is also a solution.

The algorithm does require synchronization with the CPU to read the rank if the input tensors are CUDA tensors. In fact, any rank-deficient algorithm that handles degeneracies will suffer from this issue.

@mruberry , what do you think? Shall we extend the functionality to rank-deficient cases with an additional flag? We will only need the tensor of ranks to always live on CPU. I think this price is worth paying given that ML is going in the direction of larger but sparser/lower dimensionality models...

rgommers commented 3 years ago

Thanks for the explanation @nikitaved. So the options seem to be:

Clearly document current behavior, and let the user deal with the problem themselves. The docs at https://pytorch.org/docs/master/linalg.html do not mention rank deficiency at all yet - seems important.
Add some safeguards like @nikitaved proposes unconditionally.
Add some safeguards that can be enabled/disabled by the user (e.g. via a keyword), to trade performance for reliability.

(1) isn't very user-friendly, (2) is going to make some people unhappy because it costs performance, the problem with (3) is that it's hard to predict when rank deficiency will occur (it may depend on values in the input to a model).

mruberry commented 3 years ago

Thanks for the write-up, @nikitaved. I think we should start with (1) since that can happen now, and we should try to accurately document the state of PyTorch.

One thing I don't understand yet is how our planned future is different from NumPy and SciPy. In the future we'll have

torch.linalg.solve analogous to np.linalg.solve
torch.linalg.inv analogous to np.linalg.inv
torch.linalg.lu_solve analogous to scipy.linalg.lu_solve

None of these three functions exist in PyTorch yet. @nikitaved, should we solve this by implementing those functions to be consistent with the mathematics of their NumPy and SciPy counterparts instead of trying to update the behavior of torch.inverse and torch.lu_solve?

rgommers commented 3 years ago

None of these three functions exist in PyTorch yet

The docs for torch.linalg.solve in PR gh-48456 do contain a note about behaviour for singular input, will throw a RuntimeError.

The docs for torch.linalg.inv in PR gh-48261 mention a similar thing (RuntimeError) for non-invertible matrices. Cc @IvanYashchuk.

I'll note that both NumPy and SciPy raise a custom error (LinalgError), which may be more useful than RuntimeError because it's easier to catch and more descriptive.

should we solve this by implementing those functions to be consistent with the mathematics of their NumPy and SciPy counterparts instead of trying to update the behavior of torch.inverse and torch.lu_solve?

It seems to me that torch.inverse and torch.lu_solve should still either be fixed or deprecated?

mruberry commented 3 years ago

Thanks @rgommers, that's helpful. I agree we should deprecate torch.inverse when we have torch.linalg.inv, and deprecate torch.lu_solve when we implement torch.linalg.lu_solve.

cc @heitorschueroff who's going to review linalg deprecations for 1.8 and 1.9.

ybagdasa commented 2 years ago

EDIT: see edit below

Has this been resolved? In the docs for torch.inverse, it says "Alias for torch.linalg.inv()" https://pytorch.org/docs/stable/generated/torch.inverse.html

I just ran into this problem when trying to fit linear regression directly by computing the inverse. It was amazing, because for one dataset it gave results consistent with using numpy's linalg library, but for a subset of the same dataset (90+% of it) the results were wildly different. I had to go on a goose chase to track down where the differences were occurring and figured out that they were in the inversion step.

Numpy

Screenshot from 2021-12-01 13-15-11

Torch 1.10

Screenshot from 2021-12-01 13-15-18

This is pretty dangerous if you don't know about it.

EDIT: The difference between numpy and torch implementation actually has to do with float32 vs float64 precision. When casting the tensors/arrays to float64 prior to making the computation, the numpy and torch linear algebra operations agree.

lezcano commented 2 years ago

A few points are in order:

inverse and lu_solve just work with invertible matrices. If the matrix is close to singular, they may return the wrong results. You may want to look at linalg.lstsq to do that in the singular case, although it just works on CPU for singular inputs, so you'd need to cast your inputs to cpu and back to use it if you want to use it on GPU.
All the linalg functions have this precision problems with some inputs on float32. If precision is important, cast the inputs to float64. There's not much that we can do about that. Numpy has float64 as its default dtype, while PyTorch has float32 (due to float64 being much slower on GPUs.
The invertibility problems are already documented in linalg.inv
Once the PR https://github.com/pytorch/pytorch/pull/66933 is merged, using linalg.lu_factor will result in a similar error as that of linalg.inv. Now, as always, if you use close-to-singular matrices on float32 precision you are in for a rough time.

I will add linalg.lu_solve soon™, where I'll clean the API and so on, but this one will have the same caveats as torch.lu_solve when it comes to close-to-singular inputs. Even more, this expects as input the output of linalg.lu_factor, so the check for singular inputs will be made in linalg.lu_factor, and linalg.lu_solve will assume that the given factored matrix is sufficiently well-conditioned.

Alas, I don't think that there's much we can do to address this issue as it boils down to the usual caveats of using floating point numbers. I'd say we close it.

lezcano commented 2 years ago

Closing for now.