Closed dlaplagne closed 2 years ago
Hi @dlaplagne , I'm glad you're using AVA. It looks like there's a numerical instability that the model is running into, which puts NaNs in the parameters. My best guess of what is happening is that the approximate posteriors are becoming singular. If this is the case, then modifying this line in ava/models/vae.py
may fix the issue. Try changing this:
d = torch.exp(self.fc43(d))
to this:
d = torch.exp(self.fc43(d)) + 1e-3
Adding the small constant could stabilize things. Once this line is changed, reinstall AVA with pip install .
and let me know if the model runs into the same problem. If it doesn't, then I'll update the code with this fix.
One other thing I noticed in the reconstructions you attached: there is a high contrast between the pre-syllable and post-syllable times (dark blue) and the background noise during the syllable times (blue/green). The VAE tries to explain variance in these images, and a large portion of variance in these images reflects just the duration of the syllable segment, not the syllables themselves. I would recommend increasing the spec_min_val
parameter in the preprocessing step so that the colors within and outside the syllable are less distinct.
Hello @jackgoffinet
Thanks, I'll try adding 1e-3 and get back to you.
I'm actually producing the h5 files with the specs in matlab from pre-extracted audio clips. I'll try padding each one with its mean intensity value (instead of my current fixed low value), that should make for less contrast between the actual spectrogram and the padding. I'll attach my matlab code for exporting the h5s in a later reply, in case it helps anyone in a similar situation.
cheers
Hi @dlaplagne,
It would also be helpful to check the range of intensity values in the spectrograms - neural networks have a tough time with large values, so if the spectrograms have values ranging from 0-255 (like most images), you might run into this error (I have in the past). If the spectrograms have a large range of intensities, it would be useful to normalize them to lay in a smaller range (0-1, for example) prior to training.
Thanks, Miles
Chiming in to say that we've observed similar issues: https://github.com/yardencsGitHub/tweetynet/pull/76
tl;dr: you might do pre-processing that's good for visualizing, e.g. a log transform and a thresholding, which gives you high contrast like @jackgoffinet observed, but then this causes really unstable training when you feed those arrays into a net
So not only is it good to keep small floating point values b/t 0 and 1, but you probably also want to have values in the spectrogram smoothly varying if possible
Ok @jackgoffinet I think the advise on the padding was spot-on. I padded the spectrograms with their mean instead and the training proceeded without errors and indeed the model now focused more on coding the USVs themselves: more_reconstructions.pdf
After some 100 epochs the training stabilized at a loss of around 78. As it can be seen in the reconstructions, some features of rat USVs seem harder to encode, particularly the trills. I will keep exploring this, maybe focusing on USVs with highest signal/noise.
I did try the 1e-3 addition on the old specs and the training didn't crash (the crashing was unpredictable so I can't know for sure if it will never happen). The model clearly focuses on getting the duration right: oldpadding_1e-3_reconstruction.pdf
Thank you @mdmarti and @NickleDave for your input. My values were already between 0 and 1 to mimic the AVA specs. Here's a histogram of pixel values for some 1000 specs (before and after padding and reshaping to 128x128):
I guess the issue is closed? Perhaps there's another forum to freely discuss using the autoencoder on vocalizations?
@dlaplagne, that looks much better. I won't add the 1e-3 for now unless it seems to be a common problem. Here are a couple more things to try that may improve the results:
1) It looks like the mean value padding you added is a mean value within each syllable. Padding with a mean value across all the syllables will make the images more consistent. I would also increase the noise floor to this mean value, with everything at or below the mean set to zero, the maximum image value set to 1, and everything in between linearly interpolated between these two values. This will get rid of some of the noise-related variability in the images that we don't care about modeling.
2) There's a parameter that you can pass to the VAE called model_precision
that controls the reconstruction/regularization tradeoff the VAE makes. If you increase this value from its default value of 10 to, say 20, it should improve the quality of the reconstructions at the price of getting a less well-behaved latent space. Adding more training data, if you have any more, will also improve the reconstructions.
I'll close this issue now, but feel free to re-open it or open another issue if you run into more problems. I think this Issues tab is the best place for these sorts of questions.
Best,
Jack
Dear PersonLab,
I am running with an issue while training the VAE with 100k specs (code and output follows) where the training can fail with a ValueError. I am training on specs I made myself from previously segmented rat USVs but they seem to work properly as the training runs and the training spectrograms look ok in the reconstruction.pdf syllables_0000.zip reconstruction.pdf
Here is my python environment: environment.txt
Thank you for making the code available, looking forward to use it in my research.
from ava.data.data_container import DataContainer from ava.models.vae import X_SHAPE, VAE from ava.models.vae_dataset import get_syllable_partition, get_syllable_data_loaders import os
root = '/home/dalaplagne/ava' spec_dirs = ['/scratch/local/dalaplagne/specs']
model = VAE(save_dir=root) partition = get_syllable_partition(spec_dirs, split=1, max_num_files=None) num_workers = 1 loaders = get_syllable_data_loaders(partition, num_workers=num_workers) loaders['test'] = loaders['train'] model.train_loop(loaders, epochs=256, test_freq=None, save_freq=10)
======================================== Training: epochs 0 to 255 Training set: 100000 Test set: 100000
Epoch: 0 Average loss: 1474.0030 Epoch: 1 Average loss: 182.2966 Epoch: 2 Average loss: 153.2063 Epoch: 3 Average loss: 134.9274 Traceback (most recent call last): File "/home/dalaplagne/ava/train_ava.py", line 16, in
model.train_loop(loaders, epochs=256, test_freq=None, save_freq=10)
File "/home/dalaplagne/.conda/envs/vocal/lib/python3.9/site-packages/ava/models/vae.py", line 417, in train_loop
loss = self.train_epoch(loaders['train'])
File "/home/dalaplagne/.conda/envs/vocal/lib/python3.9/site-packages/ava/models/vae.py", line 350, in train_epoch
loss = self.forward(data)
File "/home/dalaplagne/.conda/envs/vocal/lib/python3.9/site-packages/ava/models/vae.py", line 312, in forward
latent_dist = LowRankMultivariateNormal(mu, u, d)
File "/home/dalaplagne/.conda/envs/vocal/lib/python3.9/site-packages/torch/distributions/lowrank_multivariate_normal.py", line 109, in init
super(LowRankMultivariateNormal, self).init(batch_shape, event_shape,
File "/home/dalaplagne/.conda/envs/vocal/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in init
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (64, 32)) of distribution LowRankMultivariateNormal(loc: torch.Size([64, 32]), cov_factor: torch.Size([64, 32, 1]), cov_diag: torch.Size([64, 32])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=)