training fails for certain dataset size

coh-racng commented 5 years ago

Issue similar to #221 My dataset has 57345 cells and it so happens that 57345%128 == 1. so I end up with a batch that has size 1 during training, raising an error

Traceback (most recent call last):
  File "../code/CAR_T.py", line 37, in <module>
    correct.run_scvi()
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/singlecell/code/scvi_analysis/batch_correct.py", line 80, in run_scvi
    self.trainer.train(n_epochs=self.n_epochs, lr=lr)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/inference/trainer.py", line 140, in train
    loss = self.loss(*tensors_list)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/inference/inference.py", line 48, in loss
    reconst_loss, kl_divergence = self.model(sample_batch, local_l_mean, local_l_var, batch_index)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/models/vae.py", line 203, in forward
    px_scale, px_r, px_rate, px_dropout, qz_m, qz_v, z, ql_m, ql_v, library = self.inference(x, batch_index, y)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/models/vae.py", line 166, in inference
    qz_m, qz_v, z = self.z_encoder(x_, y)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/models/modules.py", line 125, in forward
    q = self.encoder(x, *cat_list)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/isi-dcnl/ifs/user_data/vjonsson/racng/git/scVI/scvi/models/modules.py", line 72, in forward
    x = layer(x)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 66, in forward
    exponential_average_factor, self.eps)
  File "/opt/Python/3.6.5/lib/python3.6/site-packages/torch/nn/functional.py", line 1251, in batch_norm
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size [1, 128]

Potential fix by adjusting batch size

batch_size = 128
while self.gene_dataset.nb_cells % batch_size == 1:
        batch_size += 1 # adjust batch size such that no batch has only one cell
trainer = UnsupervisedTrainer(vae, self.gene_dataset, train_size=train_size, \
        use_cuda=self.use_cuda, frequency=1, data_loader_kwargs={'batch_size':batch_size})

ANazaret commented 5 years ago

Excellent find! Great error. Yes your workaround is fine for personal usage. However I don't really want to change the batchsize in the code just because of that.

I guess we should just ignore batches of size < 3 in the training. If we have more than 1 epoch, the RandomSampler will likely sample the heldout cells in a big batch after, so not a big deal. What do you think @romain-lopez , @gabmis ?

romain-lopez commented 5 years ago

that makes sense to me !

scverse / scvi-tools

training fails for certain dataset size #426