Closed saskra closed 1 year ago
For the training and application scripts the data is expected to be in hdf5 format. There are two helper functions ("prepare_images" and "prepare_masks") in "utils/h5_converter.py" that can be used to convert your own data. Those functions convert tif files into hdf5 files, which then contain different groups. Those groups are created automatically and can be specified, e.g., in the line mentioned above. I just noticed that the default parameters for "prepare_masks" were not correct, as the flag "get_flows" and not the flag "get_boundary" would need to be set to True. Could this be the reason for the missing "flow_x" goup or did you change the parameters already before converting your data? We corrected this right away, sorry for the confusion.
Thanks, that seems to have helped here! I had converted my files according to the instructions, but I guess these parameters make the difference.
After that I came across two problems that I seem to have been able to solve myself and list here just for the sake of completeness:
File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 82, in main
period=5
TypeError: __init__() got an unexpected keyword argument 'save_top_k'
File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 98, in main
resume_from_checkpoint=resume_ckpt
TypeError: __init__() got an unexpected keyword argument 'resume_from_checkpoint'
But now I'm hanging here:
Traceback (most recent call last):
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 165, in _worker
output = module.validation_step(*input, **kwargs)
TypeError: validation_step() takes 3 positional arguments but 4 were given
Also, I always get error messages like this:
Only -1/26 files are used for training! Increase the samples per epoch.
Only -1/2 files are used for training! Increase the samples per epoch.
And it made no difference at all what number I put in that line:
After that I came across two problems that I seem to have been able to solve myself and list here just for the sake of completeness:
File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 82, in main period=5 TypeError: __init__() got an unexpected keyword argument 'save_top_k'
File "/home/saskra/PycharmProjects/Cellpose3D/train_network.py", line 98, in main resume_from_checkpoint=resume_ckpt TypeError: __init__() got an unexpected keyword argument 'resume_from_checkpoint'
However, I only got rid of these error messages by simply deleting the corresponding parameters. Maybe they are dependent on a certain version of pytorch_lightning. I created my environment using the YML file from the repository, is it still up to date?
The first 3 errors could be related to using a different version of pytorch-lightning than we did at that time. The pipeline was tested up until version 0.7.1. If you used another version, could you please try to run it with 0.7.1 instead? I hope that solves the issue. However, we are working on updating all elements to be compatible with an up-to-date version of pytorch-lightning.
The warnings thrown for setting "samples_per_epoch" to -1 (which indicates that ALL available images should be used once per epoch) are indeed misleading and wrong. The conditions for throwing this error were missing one argument, which we fixed now. Thanks for pointing this out!
Yes, it was due to the version of pytorch-lightning, in the environment.yml is then probably a wrong one.
The next error message follows:
Epoch 1: 0%| | 0/28 [00:00<?, ?it/s]python-BaseException
Traceback (most recent call last):
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
self.run_training_epoch()
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 413, in run_training_epoch
output = self.run_training_batch(batch, batch_idx)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_batch
loss = optimizer_closure()
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 529, in optimizer_closure
split_batch, batch_idx, opt_idx, self.hiddens)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 663, in training_forward
output = self.model(*args)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
return self.gather(outputs, self.output_device)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
for k in out))
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
for k in out))
File "/home/saskra/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
Unfortunately, I could not reproduce the error on my machine and I would need some more information about your setup or details about the data that you use. However, to eliminate possible environment-related issues, I updated the environment file and uploaded a windows and an ubuntu version. Sorry for the inconvenience, but once again trying to get it working with the "new" environment files might help to narrow down the problem.
Unfortunately, the new environment did not change the last error message. In fact, the reason seems to be that I wanted to use all three graphics cards. With only one I get further, but of course slower.
Okay, sorry I can't really help here. So far we don't have much experience with multi-GPU systems. Nevertheless, good to hear that it at least works with single-GPU usage.
Continued from here: https://github.com/stegmaierj/XPIWITPipelines/issues/1#issuecomment-959006007
In models/UNet3D_cellpose.py there seem to be some settings necessary that I don't understand right away. What do I have to enter here, for example: https://github.com/stegmaierj/Cellpose3D/blob/0ebdfd8090eb4b19a57b20c29bd3b91c4cfec7b9/models/UNet3D_cellpose.py#L235
At least I suspect this line behind this problem:
Is there perhaps a sample dataset with default settings to try out?