ludvb / xfuse

Super-resolved spatial transcriptomics by deep data fusion
MIT License
67 stars 13 forks source link

TensordBoard related error at epoch 500 #14

Open LucaMarconato opened 4 years ago

LucaMarconato commented 4 years ago

Hello, I am trying running XFuse on 8 Visium slides using GPUs. The training proceeds until epoch 499. Then at epoch 500 the session panics and I get this error. It seems to involve TensorBoard. Any idea on how to fix it?

[2020-09-30 12:28:39,561] INFO : Epoch 00480 | ELBO -7.875e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.087
[2020-09-30 12:28:48,967] INFO : Epoch 00481 | ELBO -7.904e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.086
[2020-09-30 12:28:58,310] INFO : Epoch 00482 | ELBO -8.582e+07 | Running ELBO -7.8119e+07 | Running RMSE 2.086
[2020-09-30 12:29:06,470] INFO : Epoch 00483 | ELBO -7.125e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.085
[2020-09-30 12:29:15,118] INFO : Epoch 00484 | ELBO -7.666e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.085
[2020-09-30 12:29:23,711] INFO : Epoch 00485 | ELBO -8.059e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.084
[2020-09-30 12:29:31,970] INFO : Epoch 00486 | ELBO -7.903e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.084
[2020-09-30 12:29:40,730] INFO : Epoch 00487 | ELBO -7.779e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.083
[2020-09-30 12:29:49,476] INFO : Epoch 00488 | ELBO -7.583e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.083
[2020-09-30 12:29:57,758] INFO : Epoch 00489 | ELBO -7.945e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.082
[2020-09-30 12:30:06,495] INFO : Epoch 00490 | ELBO -7.764e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.082
[2020-09-30 12:30:15,647] INFO : Epoch 00491 | ELBO -7.852e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.082
[2020-09-30 12:30:23,895] INFO : Epoch 00492 | ELBO -7.722e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.081
[2020-09-30 12:30:32,097] INFO : Epoch 00493 | ELBO -7.302e+07 | Running ELBO -7.8117e+07 | Running RMSE 2.081
[2020-09-30 12:30:40,771] INFO : Epoch 00494 | ELBO -7.402e+07 | Running ELBO -7.8116e+07 | Running RMSE 2.080
[2020-09-30 12:30:49,114] INFO : Epoch 00495 | ELBO -8.163e+07 | Running ELBO -7.8117e+07 | Running RMSE 2.080
[2020-09-30 12:30:58,408] INFO : Epoch 00496 | ELBO -7.960e+07 | Running ELBO -7.8117e+07 | Running RMSE 2.079
[2020-09-30 12:31:06,642] INFO : Epoch 00497 | ELBO -7.547e+07 | Running ELBO -7.8117e+07 | Running RMSE 2.079
[2020-09-30 12:31:15,746] INFO : Epoch 00498 | ELBO -8.405e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.078
[2020-09-30 12:31:24,526] INFO : Epoch 00499 | ELBO -8.042e+07 | Running ELBO -7.8118e+07 | Running RMSE 2.078
                                                                             [2020-09-30 12:31:33,190] ERROR : session panic! 
[2020-09-30 12:31:33,483] INFO : saving session to my-run/exception.session  
Traceback (most recent call last):
  File "/nfs/users/nfs_l/lm17/.local/bin/xfuse", line 8, in <module>
    sys.exit(cli())
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/utility/utility.py", line 281, in _wrapped
    return f(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/__main__.py", line 51, in _wrapped
    return f(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/__main__.py", line 312, in run
    _run(
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/run.py", line 144, in run
    train(epochs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/train.py", line 110, in train
    elbo = _epoch(epoch=epoch)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 263, in _fn
    apply_stack(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 198, in apply_stack
    default_process_message(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 159, in default_process_message
    msg["value"] = msg["fn"](*msg["args"], **msg["kwargs"])
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/train.py", line 90, in _epoch
    elbo.append(_step(x=to_device(x)))
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 263, in _fn
    apply_stack(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 198, in apply_stack
    default_process_message(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 159, in default_process_message
    msg["value"] = msg["fn"](*msg["args"], **msg["kwargs"])
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/train.py", line 78, in _step
    return -pyro.infer.SVI(model.model, model.guide, optim, loss).step(x)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/infer/svi.py", line 128, in step
    loss = self.loss_and_grads(self.model, self.guide, *args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/infer/trace_elbo.py", line 126, in loss_and_grads
    for model_trace, guide_trace in self._get_traces(model, guide, args, kwargs):
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/infer/elbo.py", line 170, in _get_traces
    yield self._get_trace(model, guide, args, kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/infer/trace_elbo.py", line 52, in _get_trace
    model_trace, guide_trace = get_importance_trace(
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/infer/enum.py", line 47, in get_importance_trace
    model_trace = poutine.trace(poutine.replay(model, trace=guide_trace),
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/trace_messenger.py", line 187, in get_trace
    self(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/trace_messenger.py", line 165, in __call__
    ret = self.fn(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/messenger.py", line 11, in _context_wrap
    return fn(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/messenger.py", line 11, in _context_wrap
    return fn(*args, **kwargs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/model/xfuse.py", line 90, in model
    return {e: _go(self.get_experiment(e), x) for e, x in xs.items()}
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/model/xfuse.py", line 90, in <dictcomp>
    return {e: _go(self.get_experiment(e), x) for e, x in xs.items()}
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/model/xfuse.py", line 88, in _go
    return experiment.model(x, zs)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/model/experiment/st/st.py", line 389, in model
    image_distr = self._sample_image(x, decoded)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/model/experiment/image.py", line 184, in _sample_image
    pyro.sample(
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/primitives.py", line 113, in sample
    apply_stack(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/pyro/poutine/runtime.py", line 201, in apply_stack
    frame._postprocess_message(msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/handlers/stats/stats_handler.py", line 76, in _postprocess_message
    self._handle(**msg)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/handlers/stats/image.py", line 15, in _handle
    self.add_images("image/ground_truth", (1 + value) / 2)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/xfuse/handlers/stats/stats_handler.py", line 37, in <lambda>
    lambda *args, **kwargs: method(
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 583, in add_images
    image(tag, img_tensor, dataformats=dataformats), global_step, walltime)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/torch/utils/tensorboard/summary.py", line 310, in image
    tensor = convert_to_HWC(tensor, dataformats)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/torch/utils/tensorboard/_utils.py", line 107, in convert_to_HWC
    tensor_CHW = make_grid(tensor_NCHW)
  File "/nfs/users/nfs_l/lm17/.local/lib/python3.8/site-packages/torch/utils/tensorboard/_utils.py", line 76, in make_grid
    assert I.ndim == 4 and I.shape[1] == 3
AssertionError
ludvb commented 4 years ago

Hi, My guess would be that this error is caused by the presence of an alpha channel in your image file. You can check this with imagemagick by running

magick identify -format "%[channels]\n" /path/to/image

If the image has an alpha channel, the command will output something like "srgba". You can strip the alpha channel by running

magick convert /path/to/image -alpha off /path/to/new_image

Verify by running magick identify on the new image. It should now output something like "srgb", without an "a" at the end.

The xfuse convert command currently doesn't check the image mode. We should probably remove the alpha channel if it's present. I will label this as a bug and implement a fix in a future release. Thanks for reporting!

LucaMarconato commented 4 years ago

Thanks @ludvb, the image had an alpha channel indeed! Removing it as you suggested fixed the issue.