N2V for large datasets - Githubissues

GregorSkoul commented 3 months ago

Hello everyone,

I would like to thank you for creating such nice module! I feel like it has a great potential for all sorts of things regarding image processing.

I tried n2v myself for the datasets that we usually deal with and found it to be a little bit confusing for large quantity of data. I followed SEM example to reproduce training and predicting - it worked well for the case of one image. However when I tried to put more than one frame into training and predicting it becomes complicated.

My first issue was with understanding how to shape the data properly for creating patches and training afterwards. Dataset I was using was a single-channel numpy array of frames 256x256, so the shape after loading data is (N_frames , 256, 256). As I understood from example the shape for patching and training should be (N_frames, 256, 256, 1). But when I try to feed it to patching function I am getting errors regarding wrong dimensions. With trial and error I figured out that shape (1, N_frames, 256, 256, 1) is compatable with patching, at least it does not throw errors anymore but the console is littered with repeated Generated patches: (16,64,64,1) that seems quite odd.

Afterwards I was finally able to train the model and when predicting another issue popped up. I was unable to make model.predict() to work for an array of frames at all. No matter how I shaped the data I always receive errors like: axes(YX) must be of length 3 or image with shape (100,256,256) and axes SYX not compatible with target axes YXC. The only way it worked for me was to make a crude for loop that was going through dataset and fed the images one by one and getting them to new image array manually.

So the questions I have are the following:

how to properly shape the data for training if there are large numbers of single channel images as numpy arrays?
how to properly and time-efficiently denoise such datasets?
can n2v or examples be improved to be more straightforward for large datasets?

I tested these things in python 3.7 environment, CUDA 11.2. I attach zip with jupyter notebook, chunk of data as an example and a text file with list of modules installed in the environment. N2V for large datasets issue.zip

I will much appreciate your help!

jdeschamps commented 3 months ago

Hi,

The underlying library (TF/Keras) always requires the channel dimensions (dimensions are ordered S(Z)YXC), so that's why in the example notebooks you will often see something like:

X = X[..., np.newaxis]

This adds a singleton dimension (if you look at X.shape, it will have a dimension of size 1 at the end).

So the questions I have are the following:

* how to properly shape the data for training if there are large numbers of single channel images as numpy arrays?

* how to properly and time-efficiently denoise such datasets?

The patching function is meant to create patches from single images, or a list of images. So if you have N_frames, you should then have the numpy arrays all in a list:m [frame_1, frame_2, ....]. Then you can use N2V_DataGenerator.generate_patches_from_list to create patches of the correct shape for your training.

Regarding the prediction, I would then use the same original list and loop over the element to call the prediction, making sure that they are "XYC", which means with a singleton dimension at the end.

* how to properly and time-efficiently denoise such datasets?

You can also try https://github.com/juglab/napari-n2v, which has additional checks on the data shape for training and it might make it more straightforward.

* can n2v or examples be improved to be more straightforward for large datasets?

Not really, because we do not develop this library anymore. While this is bad news, there is a much better one: we are reimplementing everything (including additional algorithms) into a new library that we hope will be easier to maintain and use in the future.

So for now you can try to solve your issue with this library, I don't think there should be anything preventing you from doing what you want to achieve here.

We should link to the new library in the next weeks, we will be happy to receive feedback there when things don't work!

hsuominen commented 1 month ago

In case someone else finds this thread searching for a pytorch implementation of n2v, the "new library" can be found here: https://github.com/CAREamics/careamics

jdeschamps commented 1 month ago

Yes we are pushing for the first PyPi release and I have been a bit careful to not advertise it too early, as we are currently adding breaking changes.

As soon as we feel more comfortable (in the coming weeks), we will put a link in the readme here.

juglab / n2v

N2V for large datasets #151