Open escoolioinglesias opened 1 year ago
I believe there was a feature to do that automatically (in lightning?)
@34j Oh, okay with AMP. You're right. Why do you think this is bugging out? Maybe the issue happens when executing fft_r2c_backward
Perhaps, it's an issue to be raised with PyTorch
It seems that environmental variable PYTORCH_ENABLE_MPS_FALLBACK should be set to 1, please try that https://stackoverflow.com/a/72416727
Thank you, @34j
$ conda env config vars list
PYTORCH_ENABLE_MPS_FALLBACK = 1
But still when I run svc train -t
I'm still getting:
Training: 0it [00:00, ?it/s] INFO [15:24:11] Setting current epoch to 0 train.py:198
INFO [15:24:11] Setting total batch idx to 0 train.py:213
INFO [15:24:11] Setting global step to 0 train.py:203
Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]libc++abi: terminating with uncaught exception of type c10::Error: Unsupported type byte size: ComplexFloat
Exception raised from getGatherScatterScalarType at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/View.mm:758 (most recent call first):
frame #0: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 92 (0x1054e92b8 in libc10.dylib)
frame #1: at::native::mps::getGatherScatterScalarType(at::Tensor const&) + 304 (0x15e92266c in libtorch_cpu.dylib)
frame #2: invocation function for block in at::native::mps::gatherViewTensor(at::Tensor const&, at::Tensor&) + 128 (0x15e9241bc in libtorch_cpu.dylib)
frame #3: _dispatch_client_callout + 20 (0x1bee501b4 in libdispatch.dylib)
frame #4: _dispatch_lane_barrier_sync_invoke_and_complete + 56 (0x1bee5f414 in libdispatch.dylib)
frame #5: at::native::mps::gatherViewTensor(at::Tensor const&, at::Tensor&) + 888 (0x15e922d54 in libtorch_cpu.dylib)
frame #6: at::native::mps::mps_copy_(at::Tensor&, at::Tensor const&, bool) + 3096 (0x15e87a47c in libtorch_cpu.dylib)
frame #7: at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) + 1944 (0x15a5f6fe0 in libtorch_cpu.dylib)
frame #8: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 100 (0x15a5f6788 in libtorch_cpu.dylib)
frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool), &(torch::ADInplaceOrView::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool))>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 76 (0x15e5521a8 in libtorch_cpu.dylib)
frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool), &(torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool))>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) + 772 (0x15e54f880 in libtorch_cpu.dylib)
frame #11: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 288 (0x15b32d0f4 in libtorch_cpu.dylib)
frame #12: torch::autograd::generated::details::fft_r2c_backward(at::Tensor const&, c10::ArrayRef<long long>, long long, bool, c10::SymInt) + 788 (0x15e4f7c74 in libtorch_cpu.dylib)
frame #13: torch::autograd::generated::FftR2CBackward0::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 312 (0x15cb31144 in libtorch_cpu.dylib)
frame #14: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 120 (0x15da92008 in libtorch_cpu.dylib)
frame #15: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&) + 2932 (0x15da88df4 in libtorch_cpu.dylib)
frame #16: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&) + 640 (0x15da87c98 in libtorch_cpu.dylib)
frame #17: torch::autograd::Engine::thread_init(int, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 336 (0x15da8697c in libtorch_cpu.dylib)
frame #18: torch::autograd::python::PythonEngine::thread_init(int, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 112 (0x106f51898 in libtorch_python.dylib)
frame #19: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (torch::autograd::Engine::*)(int, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&, bool), torch::autograd::Engine*, signed char, std::__1::shared_ptr<torch::autograd::ReadyQueue>, bool> >(void*) + 76 (0x15da95168 in libtorch_cpu.dylib)
frame #20: _pthread_start + 148 (0x1bf01426c in libsystem_pthread.dylib)
frame #21: thread_start + 8 (0x1bf00f08c in libsystem_pthread.dylib)
zsh: abort svc train -t
Try replacing here with if True: to patch stft if isinstance(self.trainer.accelerator, TPUAccelerator): https://github.com/34j/so-vits-svc-fork/blob/main/src/so_vits_svc_fork/train.py#L179-L179
I did that and now my code looks like this:
# check if using tpu
if True: to patch stft
if isinstance(self.trainer.accelerator, TPUAccelerator):
# patch torch.stft to use cpu
LOG.warning("Using TPU. Patching torch.stft to use cpu.")
I get an error saying "TabError: inconsistent use of tabs and spaces in indentation"
I thought it wasnt possible to train on a mac?
I mean
if True:
# patch torch.stft to use cpu
LOG.warning("Using TPU. Patching torch.stft to use cpu.")
@34j
It looks like that worked!
It's running thru the epochs. I will dedicate tomorrow to training a model and I'll let you know if I was successful.
Thank you so much for your help 🙏
@34j
It looks like that worked!
It's running thru the epochs. I will dedicate tomorrow to training a model and I'll let you know if I was successful.
Thank you so much for your help 🙏
Does it work good? Is it fast?
@spicymango73
I'm going to do a better test tomorrow and I'll follow up. I only got 3 epochs in and it was taking roughly 40 seconds per. I'm not sure what that means in terms of speed and performance. Also don't know, if on the other side, the model will be usable or not. Will update with more info when I can
@allcontributors add escoolioinglesias bug, userTesting
@34j
I've put up a pull request to add @escoolioinglesias! :tada:
So far have been unsuccessful with training. I still want to conduct a few more tests and then will try to make a quick report.
I have M2. Not running on MPS
I've read some issues about mps of pytorch, it turns out that currently mps doesn't support complex types (like 1+2j). But I think svc requires complex types. One of the current solution is adding a.to("cpu") before the operations which are not supported and a.to("mps") after that.
Could this be a temporary workaround for an M1 version than can train? Once pytorch supports all these operations, though, these added codes should be removed.