pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.43k stars 636 forks source link

torchaudio mobile? #408

Open dkashkin opened 4 years ago

dkashkin commented 4 years ago

🚀 Feature

torchaudio should work on Android and iOS platforms

Motivation

Mobile apps need a fast way to preprocess audio (i.e. generate spectrograms).

Pitch

I would love to see a lightweight interface that provides access to all torchaudio functions from Java and ObjectiveC. Heavy logic like FFT should be optimized for performance (i.e. pushed to GPU whenever available).

Alternatives

Tensorflow Lite is moving in this direction (mfcc and most other signal processing ops are already whitelisted).

Additional context

vincentqb commented 4 years ago

Transforms and functionals are now jitable. Have you tried exporting your model using jit and then importing on mobile? see mobile page Is that what you are trying to do?

dkashkin commented 4 years ago

Wow this sounds great! I have not noticed this in the docs and assumed that heavy stuff like FFT is not supported yet. I'll test it out over the next couple of days and share my results here.

dkashkin commented 4 years ago

@vincentqb I just traced the simplest possible module (with just a single torchaudio.transforms.MelSpectrogram) and attempted to run it in the demo app (github.com/pytorch/android-demo-app with org.pytorch:pytorch_android:1.4.0.1). Unfortunately, the forward method crashes on Android:

fft: ATen not compiled with MKL support

File "code/__torch__/torch/nn/modules/module/___torch_mangle_5.py", line 37
        _16 = ops.prim.NumToTensor(torch.size(input1, 2))
        input2 = torch.view(input1, [int(_15), int(_16)])
        spec_f = torch.stft(input2, 1024, 143, 1024, window, False, True)
                 ~~~~~~~~~~ <--- HERE
        _17 = ops.prim.NumToTensor(torch.size(spec_f, 1))
        _18 = ops.prim.NumToTensor(torch.size(spec_f, 2))    
        at org.pytorch.NativePeer.forward(Native Method)
        at org.pytorch.Module.forward(Module.java:37)
        at org.pytorch.helloworld.MainActivity.onCreate(MainActivity.java:74)

Can you please confirm whether pytorch mobile runtime is indeed supposed to support the STFT transform?

vincentqb commented 4 years ago

Indeed, fft is not currently supported on pytorch mobile, as mentioned here.

dkashkin commented 4 years ago

Ouch :( Do you think this will be resolved soon? If not, I would highly recommend improving the PyTorch Mobile documentation to specify which transforms are supported versus not. Also, I hope this feature request will stay open until all torchaudio features start working on mobile.

vincentqb commented 4 years ago

@dreiss @supriyar -- do you have some information to share about the functions that are supported within mobile, or the ones that would welcome contributions from the community?

himajin2045 commented 4 years ago

I'am working on an iOS project and hitting this error:

fft: ATen not compiled with MKL support

Although Apple has the vDSP framework could do this kind of processing, I would like to stick to the torch solution due to the simplicity and consistency.

I hope we could have fft support in mobile since it's a very basic building block in audio processing and would be needed in many speech related models.

dreiss commented 4 years ago

We enable every forward CPU op on mobile. I think the issue here is that this particular op doesn't have a portable implementation. We would be interested in a PR added one. Another option would be to do the fft before feeding the data to your model.

vincentqb commented 4 years ago

We enable every forward CPU op on mobile. I think the issue here is that this particular op doesn't have a portable implementation. We would be interested in a PR added one. Another option would be to do the fft before feeding the data to your model.

Is there a PR we could link to indicating which operations we would welcome contributions for?

dreiss commented 4 years ago

I don't have a list because we haven't tested out every niche op, but if there is any op that doesn't have a non-Intel implementation, I'd be open to seeing a PR.

PCerles commented 4 years ago

Don't have the time to personally PR, but you can do a naive implementation of rfft with matmuls, e.g.

(This code is translated from Tensorflow Magenta at https://github.com/tensorflow/magenta/blob/cf80d19fc0c2e935821f284ebb64a8885f793717/magenta/music/melspec_input.py#L64-L90, I removed some padding code I didn't need)

def _dft_matrix(dft_length):
    # type: (int) -> Tuple[Tensor, Tensor]
    real = 2 * math.pi / float(dft_length)
    imag = 2 * math.pi / float(dft_length)

    sum_components = torch.ger(torch.arange(dft_length), torch.arange(dft_length))

    keep_values = dft_length // 2 + 1
    real_part = torch.cos(real * sum_components)[:keep_values, :].transpose(0, 1)
    imag_part = -torch.sin(imag * sum_components)[:keep_values, :].transpose(0, 1)

    return real_part, imag_part

def _naive_rdft(signal_tensor, fft_length):
    # type: (Tensor, int) -> Tensor
    """Implement real-input Fourier Transform by matmul."""
    # We are right-multiplying by the DFT matrix, and we are keeping
    # only the first half ("positive frequencies").
    # So discard the second half of rows, but transpose the array for
    # right-multiplication.
    # The DFT matrix is symmetric, so we could have done it more
    # directly, but this reflects our intention better.
    real_dft_tensor, imag_dft_tensor = _dft_matrix(fft_length)
    signal_frame_length = signal_tensor.shape[-1]
    result_real_part = torch.bmm(signal_tensor, real_dft_tensor.unsqueeze(0))
    result_imag_part = torch.bmm(signal_tensor, imag_dft_tensor.unsqueeze(0))

    return torch.stack([result_real_part, result_imag_part], dim=3)

# windowed_frames [1, T, 512]
out_check = _naive_rdft(windowed_frames, 512)
out = torch.rfft(windowed_frames, 1)
assert torch.allclose(out_check, out, atol=1e-5)
dstrube1 commented 3 years ago

When I try to import torchaudio into my Android project's app build.gradle file like so: implementation 'org.pytorch:pytorch_android_torchaudio:1.5.0'

I get this error: Could not find org.pytorch:pytorch_android_torchaudio:1.5.0.

Is this the same issue as what's reported above?

mthrok commented 3 years ago

When I try to import torchaudio into my Android project's app build.gradle file like so: implementation 'org.pytorch:pytorch_android_torchaudio:1.5.0'

I get this error: Could not find org.pytorch:pytorch_android_torchaudio:1.5.0.

Is this the same issue as what's reported above?

@dstrube1

There is no android package dedicated for torchaudio. You build your model or pipeline in Python, then dump it as a Torchscript file, then load it from your app and run it with Torchscript runtime. Please refer to the following.

https://pytorch.org/mobile/home/ https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html

algat commented 2 years ago

Hi @jeffxtang, it seems that we can run fft directly with pytorch from version 1.10, due to the PocketFFT support from this commit: https://github.com/pytorch/pytorch/commit/4036820506693b71a96b9e20989bfe286acccd89

Can you confirm? thanks

jeffxtang commented 2 years ago

Hi @jeffxtang, it seems that we can run fft directly with pytorch from version 1.10, due to the PocketFFT support from this commit: pytorch/pytorch@4036820

Can you confirm? thanks

Thanks for your question. Yes it's confirmed. We're working on an update of the demos to simplify the iOS and Android code.

henryleemr commented 1 year ago

Hi, just following up if torchaudio is available on React Native now..? @mthrok @vincentqb @dkashkin Trying to do something like this:

audio, sample_rate = librosa.load(audio_filepath, sr=None)
# audio is a numpy array of floats 1xN, and sample_rate is a int

# Convert audio numpy array to spec image which is a numpy array or pytorch tensor 
clip = torch.Tensor(audio)
# clip is a pytroch tensor datatype, which has the same dimensions as audio, 1xN tensor

spec = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate, n_fft=n_fft, win_length=window_length, hop_length=hop_length, n_mels=n_mels)(clip)
# spec is a spectrogram, which is an image, data type is a pytorch tensor, and the dimensions is 3x128x15000

All on a React Native app :)

jeffxtang commented 1 year ago

Hi, just following up if torchaudio is available on React Native now..? @mthrok @vincentqb @dkashkin Trying to do something like this:

audio, sample_rate = librosa.load(audio_filepath, sr=None)
# audio is a numpy array of floats 1xN, and sample_rate is a int

# Convert audio numpy array to spec image which is a numpy array or pytorch tensor 
clip = torch.Tensor(audio)
# clip is a pytroch tensor datatype, which has the same dimensions as audio, 1xN tensor

spec = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate, n_fft=n_fft, win_length=window_length, hop_length=hop_length, n_mels=n_mels)(clip)
# spec is a spectrogram, which is an image, data type is a pytorch tensor, and the dimensions is 3x128x15000

All on a React Native app :)

@raedle may know if torchaudio is available on React Native.

Btw, both the iOS and Android streaming ASR demo apps using PyTorch 1.12 and torchaudio 0.12 were updated in July.

raedle commented 1 year ago

@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native.

I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?

henryleemr commented 1 year ago

Yeah, i couldn't find the part of the code where we take a stream of audio from react native, and convert it tensors, or the part that converts the raw audio streams into Mel Spectrograms, anybody got a clue?

raedle commented 1 year ago

@henryleemr PlayTorch doesn't support audio streaming. Do you have an end-to-end example for either iOS or Android w/o React Native?

I might be open looking into this for PlayTorch, which would enable others using audio streaming in the future

jeffxtang commented 1 year ago

@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native.

I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?

@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr

henryleemr commented 1 year ago

@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native. I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?

@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr

Ah cool! So if I were to say, use ResNet for the spectrogram (Which is just an image), I can do something like this?

import torch
import torch.nn as nn
import torchaudio
import torchvision.models as models

class ResNet(nn.Module):
    def __init__(self, dataset, pretrained=True):
        super(ResNet, self).__init__()
        num_classes = 4

        self.transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, n_mels=80, hop_length=160)
        self.model = models.resnet50(pretrained=pretrained)
        self.model.fc = nn.Linear(2048, num_classes)

    def forward(self, raw_audio_tensor):
        spectrogram = self.transform(raw_audio_tensor)
        output = self.model(spectrogram)
        return output

Correct me if I am wrong, once we've exported the model trained using this class into a .ptl model file, we can just use the usual react native methods to load the model and pass raw audio streams into this model and the Mel Spectrogram transformation logic would be applied to the raw audio stream without needing any react native or javascript Mel Spectrogram functions?

jeffxtang commented 1 year ago

@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native. I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?

@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr

Ah cool! So if I were to say, use ResNet for the spectrogram (Which is just an image), I can do something like this?

import torch
import torch.nn as nn
import torchaudio
import torchvision.models as models

class ResNet(nn.Module):
  def __init__(self, dataset, pretrained=True):
      super(ResNet, self).__init__()
      num_classes = 4

      self.transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, n_mels=80, hop_length=160)
      self.model = models.resnet50(pretrained=pretrained)
      self.model.fc = nn.Linear(2048, num_classes)

    def forward(self, raw_audio_tensor):
        spectrogram = self.transform(raw_audio_tensor)
        output = self.model(spectrogram)
        return output

Correct me if I am wrong, once we've exported the model trained using this class into a .ptl model file, we can just use the usual react native methods to load the model and pass raw audio streams into this model and the Mel Spectrogram transformation logic would be applied to the raw audio stream without needing any react native or javascript Mel Spectrogram functions?

Based on how the model that implements the melspec transform is used in the iOS & Android apps, yes that's correct.

@hwangjeff @mthrok can you please double confirm?

mthrok commented 1 year ago

yeah, the script looks good to me.

Consulting4J commented 5 months ago

@jeffxtang Hello , any update in 2024 ;-), we do think consistency is critical as we trained on PC via torchaudio feature extractor (MFCC etc,.) , we do think we need the same library to do the feature extraction for other platforms like Android, iOS .. mostlikely due to the performance gap we may need try to convert pytorch models to onnx models durring deployment phase, so keep the feature extraction consistency is critical to make the model runs consistently across all platforms. Thanks alot !