Object 'flow_x' doesn't exist

saskra commented 2 years ago

Continued from here: https://github.com/stegmaierj/XPIWITPipelines/issues/1#issuecomment-959006007

In models/UNet3D_cellpose.py there seem to be some settings necessary that I don't understand right away. What do I have to enter here, for example: https://github.com/stegmaierj/Cellpose3D/blob/0ebdfd8090eb4b19a57b20c29bd3b91c4cfec7b9/models/UNet3D_cellpose.py#L235

At least I suspect this line behind this problem:

train_network.py --output_path own_model/results/ --log_path own_model/logs/ --gpus 3 --model Cellpose3D
Connected to pydev debugger (build 212.5457.59)
Saving 20 data samples for sanity checks...
Getting statistics from images...
Only -1/26 files are used for training! Increase the samples per epoch.
Only -1/26 files are used for training! Increase the samples per epoch.
Only -1/26 files are used for training! Increase the samples per epoch.
python-BaseException
Traceback (most recent call last):
  File "/home/saskra/PycharmProjects/Cellpose3D/dataloader/h5_dataloader.py", line 237, in __getitem__
    mask_tmp = f_handle[group_name]           
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/h5py/_hl/group.py", line 264, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'flow_x' doesn't exist)"
Only -1/26 files are used for training! Increase the samples per epoch.

Is there perhaps a sample dataset with default settings to try out?

DEschweiler commented 2 years ago

For the training and application scripts the data is expected to be in hdf5 format. There are two helper functions ("prepare_images" and "prepare_masks") in "utils/h5_converter.py" that can be used to convert your own data. Those functions convert tif files into hdf5 files, which then contain different groups. Those groups are created automatically and can be specified, e.g., in the line mentioned above. I just noticed that the default parameters for "prepare_masks" were not correct, as the flag "get_flows" and not the flag "get_boundary" would need to be set to True. Could this be the reason for the missing "flow_x" goup or did you change the parameters already before converting your data? We corrected this right away, sorry for the confusion.

saskra commented 2 years ago

Thanks, that seems to have helped here! I had converted my files according to the instructions, but I guess these parameters make the difference.

After that I came across two problems that I seem to have been able to solve myself and list here just for the sake of completeness:

  File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 82, in main
    period=5
TypeError: __init__() got an unexpected keyword argument 'save_top_k'

  File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 98, in main
    resume_from_checkpoint=resume_ckpt
TypeError: __init__() got an unexpected keyword argument 'resume_from_checkpoint'

But now I'm hanging here:


Traceback (most recent call last):
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 165, in _worker
    output = module.validation_step(*input, **kwargs)
TypeError: validation_step() takes 3 positional arguments but 4 were given

Also, I always get error messages like this:

Only -1/26 files are used for training! Increase the samples per epoch.
Only -1/2 files are used for training! Increase the samples per epoch.

And it made no difference at all what number I put in that line:

https://github.com/stegmaierj/Cellpose3D/blob/646b7ac57cbb42ddfb5627d1e4336feb7b71ed45/models/UNet3D_cellpose.py#L239

saskra commented 2 years ago

After that I came across two problems that I seem to have been able to solve myself and list here just for the sake of completeness:

  File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 82, in main
    period=5
TypeError: __init__() got an unexpected keyword argument 'save_top_k'

  File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 98, in main
    resume_from_checkpoint=resume_ckpt
TypeError: __init__() got an unexpected keyword argument 'resume_from_checkpoint'

However, I only got rid of these error messages by simply deleting the corresponding parameters. Maybe they are dependent on a certain version of pytorch_lightning. I created my environment using the YML file from the repository, is it still up to date?

DEschweiler commented 2 years ago

The first 3 errors could be related to using a different version of pytorch-lightning than we did at that time. The pipeline was tested up until version 0.7.1. If you used another version, could you please try to run it with 0.7.1 instead? I hope that solves the issue. However, we are working on updating all elements to be compatible with an up-to-date version of pytorch-lightning.

The warnings thrown for setting "samples_per_epoch" to -1 (which indicates that ALL available images should be used once per epoch) are indeed misleading and wrong. The conditions for throwing this error were missing one argument, which we fixed now. Thanks for pointing this out!

saskra commented 2 years ago

Yes, it was due to the version of pytorch-lightning, in the environment.yml is then probably a wrong one.

The next error message follows:

Epoch 1:   0%|          | 0/28 [00:00<?, ?it/s]python-BaseException
Traceback (most recent call last):
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 413, in run_training_epoch
    output = self.run_training_batch(batch, batch_idx)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_batch
    loss = optimizer_closure()
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 529, in optimizer_closure
    split_batch, batch_idx, opt_idx, self.hiddens)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 663, in training_forward
    output = self.model(*args)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
    return self.gather(outputs, self.output_device)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

DEschweiler commented 2 years ago

Unfortunately, I could not reproduce the error on my machine and I would need some more information about your setup or details about the data that you use. However, to eliminate possible environment-related issues, I updated the environment file and uploaded a windows and an ubuntu version. Sorry for the inconvenience, but once again trying to get it working with the "new" environment files might help to narrow down the problem.

saskra commented 2 years ago

Unfortunately, the new environment did not change the last error message. In fact, the reason seems to be that I wanted to use all three graphics cards. With only one I get further, but of course slower.

DEschweiler commented 2 years ago

Okay, sorry I can't really help here. So far we don't have much experience with multi-GPU systems. Nevertheless, good to hear that it at least works with single-GPU usage.

stegmaierj / Cellpose3D

Object 'flow_x' doesn't exist #2