NaN returned from latent if more than default n_latent used

davemcg commented 4 years ago

Not certain how I should be debugging this situation, but I'm getting NaN returned for the latent dimensions I ask for more than the default. What should I be doing in this situation?

My dataset is 1999 genes by 452,991 cells.

I'd be happy to share the loom file if that helps.

I'm using the latest version (0.5) of scVI and pytorch installed from conda.

import sys
import matplotlib; matplotlib.use('agg')
import torch
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scvi.dataset import GeneExpressionDataset
from scvi.models import VAE, LDVAE
from scvi.inference import UnsupervisedTrainer
from scvi.inference.posterior import Posterior
from scvi.dataset import LoomDataset

args = sys.argv

loom_dataset = LoomDataset('data.loom', save_path = '')

vae = VAE(loom_dataset.nb_genes, n_batch=loom_dataset.n_batches * True, n_latent = 50, dispersion='gene-batch')
trainer = UnsupervisedTrainer(vae, loom_dataset, use_cuda=True)

n_epochs = 5
trainer.train(n_epochs=n_epochs)

training: 100%|████████████████| 5/5 [04:45<00:00, 57.00s/it]

# extract
full = trainer.create_posterior(trainer.model, loom_dataset, indices=np.arange(len(loom_dataset)))
latent, batch_indices, labels = full.sequential().get_latent()
latent[1:10,1:10]

[[nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan] [nan nan nan nan nan nan nan nan nan]]

adamgayoso commented 4 years ago

Do you only have this issue when you use 50 latent dimensions? If not, I'd check to make sure the dataset was loaded properly.

How many batches do you have? I'd also try with dispersion='gene'

davemcg commented 4 years ago

I've tried 20 latent dims and that worked...but 50 and 75 failed. With and without switching to dispersion to 'gene'

I have 14 batches.

I'm not certain how to check if that data was loaded correctly - I can't figure out how to check the counts data.

In [3]: loom_dataset
Out[3]:
GeneExpressionDataset object with n_cells x nb_genes = 452991 x 1999
    gene_attribute_names: 'gene_names'
    cell_attribute_names: 'organism', 'batch_indices', 'CellID', 'local_vars', 'batch', 'local_means', 'labels', 'cell_type'
    cell_categorical_attribute_names: 'labels', 'batch_indices'
In [4]: loom_dataset[1:10,1:10]
Out[4]: (slice(1, 10, None), slice(1, 10, None))

Or does this mean I've borked my loom dataset? I'm starting with a SparseMatrix in R and using the loomR package to write the loom file.

You can download the loom file yourself here: http://hpc.nih.gov/~mcgaugheyd/counts_hv_labelled.loom

adamgayoso commented 4 years ago

You can use loom_dataset.X to inspect the counts and loom_dataset.batch_indices to make sure this is a 2-d array (column vector) of integer value.

You might also try reducing the learning rate. For example, trainer.train(n_epochs=n_epochs, lr=5e-4).

May I ask why you want to use so many latent dimensions? To some extent our hyper parameters were tuned so that things close to 10 work really well. When you increase to something like 50 you might also increase the chance that the latent space learns the batch, which apparently you do not want based on your test code.

Finally, when you call the UnsupervisedTrainer you may want to add n_epochs_kl_warmup=None. This is a subtle point, but the default is 400 epochs meaning scVI is more like an auto encoder and not a variational auto encoder.

davemcg commented 4 years ago

Weeeeeell I'm trying lots of different integration methods and that parameter can make a big difference...but if you think that's silly thing to tweak for your method, then I'll listen. You should know best!

So for a large, disparate (all retina cells, but from 14 independent studies across three species, many time points, and with different selection methods (which will generate highly different proportions of retina cell types)) with ~450,000 cells, you would suggest (roughly):

256 hidden layers
1 layer
lr 0.001 or 0.005 or so
100 epochs

Thanks for your help (and if want to take a crack at it the loom file is available with ~250,000 pre-labelled cells across 15 or so cell types).

davemcg commented 4 years ago

Also to make certain I'm not being a total idiot, the idea is that the 10 latent dims (by 450k cells) are being fed into the UMAP to turn it into a two column (UMAP1/2) by 450k cell matrix for plotting, right?

adamgayoso commented 4 years ago

Also to make certain I'm not being a total idiot, the idea is that the 10 latent dims (by 450k cells) are being fed into the UMAP to turn it into a two column (UMAP1/2) by 450k cell matrix for plotting, right?

This is correct.

I would try: n_hidden=256, n_layers=2, lr=1e-3, n_epochs_kl_warmup=None (option in UnsupervisedTrainer)

The warmup (lack of) and number of layers can help mix the batches better.

I've never run with that many cells before, but I imagine 10 epochs should be sufficient. You can plot the loss history (example in basic tutorial) and check for convergence. Alternatively, you could use early stopping (not sure we have an example in the tutorials but if you're interested I can post some starting point args here).

And just to make sure - dataset.X should contain non-normalized raw UMI counts.

Not sure I'll have time to look at it as I'm traveling, unfortunately.

Edit: You might also want to try increasing the mini batch size by adding data_loader_kwargs={"batch_size":256} to the call to the trainer. This will not only make things go faster but will increase the number of seen biological batches in a single mini batch.

davemcg commented 4 years ago

Thanks for the tips - now it seems that if I increase the hidden layers beyond 128 I get NaN returned. I've tried removing the cells with the fewest and most counts, but that didn't change the NaN results.

Anyways, from previous testing I don't recall going from 128-256 making a substantial difference in performance.

Thanks so much for your help.

(oh and changing data_loader_kwargs={"batch_size":256} doubles the speed / epoch on my dataset)

I'll probably try tweaking the number of HVG (hyper variable genes) to see how that influences the results.

adamgayoso commented 4 years ago

More genes is probably a good idea - and given how many cells you have you can likely use as many genes as possible.

adamgayoso commented 4 years ago

Closing due to inactivity.

mortonjt commented 4 years ago

FYI, I'm also running into numerical issues as well (see error below).

``` File "/mnt/home/jmorton/venvs/mmvec-torch/bin/mavi", line 7, in exec(compile(f.read(), __file__, 'exec')) File "/mnt/home/jmorton/software/mavi/scripts/mavi", line 74, in mavi() File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/mnt/home/jmorton/software/mavi/scripts/mavi", line 53, in unsupervised trainer.train(n_epochs=n_epochs) File "/mnt/home/jmorton/software/scVI/scvi/inference/trainer.py", line 177, in train self.on_training_loop(tensors_list) File "/mnt/home/jmorton/software/scVI/scvi/inference/trainer.py", line 201, in on_training_loop self.current_loss = loss = self.loss(*tensors_list) File "/mnt/home/jmorton/software/scVI/scvi/inference/inference.py", line 109, in loss sample_batch, local_l_mean, local_l_var, batch_index, y File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/mnt/home/jmorton/software/scVI/scvi/models/vae.py", line 316, in forward reconst_loss = self.get_reconstruction_loss(x, px_rate, px_r, px_dropout) File "/mnt/home/jmorton/software/scVI/scvi/models/vae.py", line 219, in get_reconstruction_loss -NegativeBinomial(mu=px_rate, theta=px_r).log_prob(x).sum(dim=-1) File "/mnt/home/jmorton/software/scVI/scvi/models/distributions.py", line 95, in __init__ super().__init__(validate_args=validate_args) File "/mnt/home/jmorton/venvs/mmvec-torch/lib/python3.6/site-packages/torch/distributions/distribution.py", line 36, in __init__ raise ValueError("The parameter {} has invalid values".format(param)) ValueError: The parameter mu has invalid values ```

I'm guessing that these are numerical issues due to the exponentiation before evaluating the NegativeBinomial.

These lines are very suspicious https://github.com/YosefLab/scVI/blob/master/scvi/models/modules.py#L270 https://github.com/YosefLab/scVI/blob/master/scvi/models/distributions.py#L46

I don't think that any exponentiation should be required in order to evaluate the log probability of NB / ZINB - it should be possible to only pass in logits to these distributions.

I'll probably tinker with this over the next few days - will submit PR if I can fix these problems.

mortonjt commented 4 years ago

Ok - there are actually quite a few points that have numerical stability issues. See my working branch is here: https://github.com/YosefLab/scVI/pull/663 I've been able to run through one of my datasets without numerical issues (still haven't verified the outputs here).

Hrovatin commented 4 years ago

I have been experiencing the same error when using autotune. Is there a recommended solution? If I understand correctly the solution PR https://github.com/YosefLab/scVI/pull/663 is not working yet, right?

scverse / scvi-tools

NaN returned from latent if more than default n_latent used #516