Results from cryoCARE_pip looking worse than cryoCARE_T2T

rdrighetto commented 1 year ago

Hi everyone,

I've been using since a while the latest version of cryo-CARE from this repo, however, I have noticed that I don't quite get the same quality of results I used to get with the old version from https://github.com/juglab/cryoCARE_T2T I have experimented a lot with the training parameters, but never managed to achieve the same denoising quality (visually judging). For comparison, I ran the same odd/even tomos through both versions using exactly the same training parameters (see below), and here's what I get:

cryoCARE_T2T: HDCR_cryocare_t2t

cryoCARE_pip: HDCR_cryocare_pip

The contrast obtained in the old version is much stronger, and also the new version has lots of background artifacts near the bacterial S-layer i.e. surrounding the bacterium. In short, the result from cryoCARE_T2T looks much cleaner (as was always the case), and I'd expect cryoCARE_pip to yield similar results.

The convergence behavior is also remarkably different in each case:

cryoCARE_T2T: cryocare_t2t_convergence

cryoCARE_pip: cryocare_pip_convergence

While for cryoCARE_T2T both the training and the validation loss (MSE) decrease ~monotonically, for cryoCARE_pip it clearly overfits (and that may be exactly the problem). Here are the model parameters, which were exactly the same in both cases:

config.json
{
"n_dim": 3, 
"axes": "ZYXC", 
"n_channel_in": 1, 
"n_channel_out": 1, 
"train_checkpoint": "weights_best.h5", 
"train_checkpoint_last": "weights_last.h5", 
"train_checkpoint_epoch": "weights_now.h5", 
"probabilistic": false, 
"unet_residual": true, 
"unet_n_depth": 3, 
"unet_kern_size": 3, 
"unet_n_first": 32, 
"unet_last_activation": "linear", 
"unet_input_shape": [null, null, null, 1], 
"train_loss": "mse", 
"train_epochs": 125, 
"train_steps_per_epoch": 75, 
"train_learning_rate": 0.0004, 
"train_batch_size": 16, 
"train_tensorboard": true, 
"train_reduce_lr": {"factor": 0.5, "patience": 10, "min_delta": 0}
}

The only caveat is that the training data is not exactly the same between both runs (each run extracted 1200 subvolumes randomly, using 10% of those for validation. Box size is 72 and tomogram size is 928 x 928 x 464). However, results are consistent across different runs using different random extractions, as well as when trying a higher number of subvolumes extracted, etc.

So, what changed between https://github.com/juglab/cryoCARE_T2T and https://github.com/juglab/cryoCARE_pip that would cause this behavior? In my virtual environments, I currently have tensorflow==2.9.0 for cryoCARE_T2T and tensorflow==2.4.0 for cryoCARE_pip so that could as well be the culprit, but it's strange that one overfits while the other doesn't. I tried to update cryoCARE_pip to use tensorflow==2.9.0 but it didn't work on my environment with CUDA 11.7 and was only using the CPU. I'm happy to provide more details and the data I used for these tests as well as the models, if anyone wants to have a look.

Thanks!

EuanPyle commented 1 year ago

Hey, I'm not part of the cryoCARE team but interested to hear about this! I'm running a test to compare between CC with tensorflow==2.4.0 and tensorflow==2.9.0 now, will let you know the results.

Re: the GPU not being found when using tf 2.9.0, I had the same issue. If you update LD_LIBRARY_PATH to find libcudnn.so.8 e.g. export LD_LIBRARY_PATH="/d/emr207/u/pe002/miniforge/envs/cryocare/lib:$LD_LIBRARY_PATH" that worked for me to fix. libcudnn.so.8 for me was kept in the lib directory where my environment was installed. You can check whether the GPU device is visible in python via

import tensorflow as tf
tf.config.list_physical_devices('GPU')

EDIT: Oh no, even with this line it still crashes due to libraries not being available! Will try to fix...

EuanPyle commented 1 year ago

Ok, got it working with tensorflow 2.9.0: when installing the package don't specify cudnn=8.0, instead just do cudnn. It should install the most recent version (8.4.1). Then do the LD_LIBRARY_PATH trick in the comment above.

EuanPyle commented 1 year ago

Can't tell the difference between models trained in v2.4.0 and v2.9.0, so I don't think it's that.

rdrighetto commented 1 year ago

OK, thanks for testing! This is important info. Did you compare with cryoCARE_T2T by any chance?

EuanPyle commented 1 year ago

I did the test on cryoCARE_pip. I haven't actually used T2T at all.

rdrighetto commented 1 year ago

Sorry I meant T2T (updated the post), thanks for the reply :-)

FWIW, a colleague at the MPI in Martinsried (@sagarbiophysics) confirms the same issue between cryoCARE_pip and cryoCARE_T2T.

EuanPyle commented 1 year ago

Ah interesting! Are the _pip tomograms an improvement on the non-denoised tomograms?

rdrighetto commented 1 year ago

Yes, they definitely improve on the non-denoised tomograms, just not as good as the old version.

tibuch commented 1 year ago

Hi @rdrighetto,

This is an interesting observation. Thank you for reporting.

Could you provide the list of installed packages for both version? It could either be that something in the data-loaders is different (this would be on the cyroCARE side) or that something changed in the underlying CSBDeep which is used for the training.

Would it be possible to send the even/odd tomograms to me for testing?

rdrighetto commented 1 year ago

Hi @tibuch, thanks for the reply!

You can download the odd+even tomograms shown here as well as the list of installed packages in the respective virtual environments from here: https://filesender.switch.ch/filesender2/?s=download&token=665490b9-4049-4c94-ad89-e230ca8701ee Please let me know if you need any additional info.

Greetings from the #Tomo2022 conference with @EuanPyle :-)

tibuch commented 1 year ago

Hi everyone,

I might have found some issues related to the quality drop. If you want to try out the fixes please check out the code from the branch fix-quality-issues.

Looking forward to feedback :slightly_smiling_face:

rdrighetto commented 1 year ago

Thanks for the update @tibuch! I just tested the new CryoCAREDataModule.py (by replacing the old version in my current installation -- hope that is fine) and got this error in cryoCARE_extract_train_data.py:

2022-10-04 19:22:42.468472: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  0%|          | 0/500 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/bin/cryoCARE_extract_train_data.py", line 45, in <module>
    main()
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/bin/cryoCARE_extract_train_data.py", line 27, in main
    dm.setup(config['odd'], config['even'], n_samples_per_tomo=config['num_slices'],
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/lib/python3.8/site-packages/cryocare/internals/CryoCAREDataModule.py", line 187, in setup
    self.train_dataset = CryoCARE_Dataset(tomo_paths_odd=tomo_paths_odd,
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/lib/python3.8/site-packages/cryocare/internals/CryoCAREDataModule.py", line 41, in __init__
    self.compute_mean_std(n_samples=n_normalization_samples)
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/lib/python3.8/site-packages/cryocare/internals/CryoCAREDataModule.py", line 83, in compute_mean_std
    x, _ = self.__getitem__(i)
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/lib/python3.8/site-packages/cryocare/internals/CryoCAREDataModule.py", line 154, in __getitem__
    return self.augment(np.array(even_subvolume)[..., np.newaxis], np.array(odd_subvolume)[..., np.newaxis])
  File "/scicore/home/engel0006/GROUP/pool-engel/soft/cryo-care/cryoCARE_pip/cryocare_11/lib/python3.8/site-packages/cryocare/internals/CryoCAREDataModule.py", line 127, in augment
    rot_axes = tuple([0,1,2].remove(["Z", "Y", "X"].index(self.tilt_axis)))
TypeError: 'NoneType' object is not iterable

tibuch commented 1 year ago

@rdrighetto thanks for trying! Now it should work :crossed_fingers:

rdrighetto commented 1 year ago

The latest fix from https://github.com/juglab/cryoCARE_pip/commit/414bd5d4300f5028291d9f5bb10ff31668a98f3a seems to resolve the quality issues. I ran the full pipeline (train data generation -> training -> prediction) twice. Results are below:

fix-run1 cryocare_pip-fix-run1

fix-run2 cryocare_pip-fix-run2

Previous results from old version cryoCARE_T2T for comparison: cryocare_t2t_new

I'd say the quality is pretty much the same in all cases. Of course results differ a little bit between each run and between the old version due to the randomness in training data generation, but they are all quite good IMO.

Interestingly, both runs of the new version still overfit (all training parameters are the same as before): cryocare_pip-fix-run1-MSE cryocare_pip-fix-run2-MSE which is not necessarily a problem. Judging from the results, it seems the validation set is doing its job. I just find strange that the MSE is always lower in the validation set than in the training set - does that make sense?

Thanks a lot @tibuch for taking the time to fix this! If there are no other concerns, the issue can be closed as fixed.

tibuch commented 1 year ago

Thank you for running it and showing the results!

One reason for the validation MSE being lower than the training MSE could be that we sample validation patches always closer to the boarder where we could have more background. This makes the validation set less complex compared to the training data, which leads to the overall difference of the two curves.

It is a hint that our validation set does not cover the same image feature distribution as the training data. I would imaging that the validation loss gets higher if the validation fraction is increased.

I would go ahead and merge this fix and go for a new release.

Thank you everyone for the discussion and testing!

tibuch commented 1 year ago

Just released v0.2.1 and uploaded it to pip.

Happy denoising!

juglab / cryoCARE_pip

Results from cryoCARE_pip looking worse than cryoCARE_T2T #30