on-the-fly preprocessing

alisterburt commented 4 months ago

@LorenzLamm what, if any, are blockers to doing the preprocessing on-the-fly as part of the inference code? This would simplify usage quite a lot and really enable <cli_tool> --input tomogram.mrc --output membranes.mrc which feels like a good goal

LorenzLamm commented 4 months ago

Yes, good point. Would be nice to have that also in the membrane segmentation CLI.

Depending on which preprocessing tools we would integrate there, the CLI could get a bit messy I guess, but there should be ways around that.

Biggest blockers so far are that:

I'm not super happy with the implementations yet, particularly the Fourier amplitude matching is a bit unstable and sometimes gives strange results (therefore, it's also nice to see intermediate tomograms before feeding into membrain-seg)
We haven't extensively explored which preprocessing techniques make the most sense. I think most important is the pixel size matching to the training range, but need to explore in a more structured way

Regarding the implementation issues: We discussed that it may make sense to outsource the preprocessing into libtilt, which probably already offers all these functionalities (and probably in a way better implementation).

I guess it would make sense to

Make preprocessing techniques robust
Explore which ones make sense to actually include in the CLI

What do you think?

alisterburt commented 4 months ago

I actually think we can probably get to the point where you don't explicitly tell the program to do any specific preprocessing - it just looks at the data, does the resizing based on the pixel size in the header and does the amplitude matching based an on-the-fly measured amplitude spectrum, this shouldn't be too expensive as the tomogram is already in RAM :-)

I added some example code to libtilt for calculating the amplitude from a tomogram on the fly by

extracting a set number of 128^3 cubes from random points in the volume
calculating the power spectra of each
rotationally averaging those power spectra

For 10 cubes the whole thing takes less than a second on CPU on my macbook pro and the spectrum does not change significantly when that becomes 20 cubes, implying the estimate is robust :-) This feels like a good starting point for doing the preprocessing in libtilt if you're unhappy with the spectrum matching implementation as-is

I added an issue to add 3D rescaling at teamtomo/libtilt#59 - I can implement this or you could use it as a chance to get to grips with libtilt? I would happily accept a PR or we could find a time to pair if you're interested - let me know whether you're interested or if I should implement myself

alisterburt commented 4 months ago

to be clear - the preprocessing code would not move into libtilt completely, libtilt would be used to achieve the necessary preprocessing functionality wherever that ends up living. I might make a 'torch-match-amplitude-spectrum' package in teamtomo for that specific functionality

alisterburt commented 4 months ago

I added a first draft of some code for 3d rescaling at teamtomo/libtilt#61

rdrighetto commented 4 months ago

My understanding is that all pre-processing is optional and users should just play with it if they are not happy with the results out of the box. So nothing prevents one from going straight to inference, in principle.

Of course, if we know that a certain pre-processing operation "always" or "usually" helps, it could be incorporated into the inference code. Right now, after chatting with Lorenz, the only promising candidate for this type of automation seems to be the pixel size matching (and we cannot always rely on the MRC header of the input for that, so we need some safeguards in place to ask the user for the true pixel size if needed).

Regarding the power spectra matching: Your implementation for measuring it is pretty neat, but during inference we would need to compute the FFT of every tile and back in order to apply it... would that still be faster than just matching the power spectrum of the entire tomogram once?

alisterburt commented 4 months ago

@rdrighetto inference already happens over sliding windows, adding an FFT/match spectrum/iFFT loop should be ~trivial? You could also apply the matching to the whole tomogram based on the estimate calculated on the fly - the estimation and application are completely decoupled. I haven't looked at the sliding inference class Lorenz used in detail but have used tiler in the past to achieve the same thing and that seems more flexible

The above was proposed to enable the whole process of using the package to become one step - if you don't think that's a worthy goal then it's fine, I was proposing things to simplify life for users and all proposed features can still be gated behind CLI arts with appropriate defaults

rdrighetto commented 4 months ago

I agree the implementation is trivial, my point is what is the best in terms of performance, do it once on the whole tomo pre-inference or do it for every small tile at inference? (I don't know the answer!)

I think I may have misunderstood what you propose, sorry. The pre-processing operations would still be optional, but performed at inference time instead of at a separate step beforehand? If that's what you mean then I agree it's interesting, and very efficient in terms of avoiding intermediate MRC files that will hardly be cleaned up by the user 😄 I thought you meant to "hide" all the pre-processing from the user and just do it under the hood.

alisterburt commented 4 months ago

Any difference will depend more on specific FFT sizes than anything else... I guess you get more control over that with tiling. There is normally a bit of overlap between tiles so there's a bit more work overall doing it that way? I don't think it's performance critical overall given how quick the 10 cubic FFTs were on CPU, anything will pale in comparison with the inference itself ☺️

I understood from Lorenz that both pixel size and spectrum matching were important so I was thinking of having them both on by default but based on what you said maybe that's not best. They can both be associated with Boolean CLI flags like --do-rescale/--no-do-rescale or --do-spectrum-match/--no-do-spectrum-match so they can still be played with but yes, the overall goal is to have everything in the inference code to avoid generating a bunch of extra files and simplify usage

LorenzLamm commented 4 months ago

Thanks a lot for starting the implementation @alisterburt !! I also agree that it should be easy to integrate the spectrum matching into the sliding window inference.

I do think that both rescaling and spectrum matching can be useful. As Ricardo said, I believe that rescaling is more important considering the large variety in pixel sizes out there for different tomograms / datasets. So I think this could be something that can be applied by default.

For fourier spectrum matching, we will need to do more extensive evaluations when it really helps, and how to best apply it. But I do think that it can help in many situations and it would be cool to have it easily activated using a boolean CLI flag (but maybe better not activated by default).

The problem I see here is that we do not have this one single tomogram that we can match all spectra of other tomograms to. I think the choice of the tomogram to match to may also depend on the content of the respective input tomogram. Maybe, with a bit more exploration of this functionality, we will find some good settings that work most of the time, but I guess that still needs a lot of testing.

alisterburt commented 4 months ago

@LorenzLamm I assumed you had rescaled the training data such that they were all similar?

LorenzLamm commented 4 months ago

@alisterburt Not exactly rescaled but required them to be in similar pixel size ranges. They are not perfectly on the same scale, unfortunately, but the range goes from roughly 10-14A + some random rescaling as data augmentation. So the model is robust to some pixel size values, but if the pixel size is too far off, one should definitely rescale (we use 10A as default value).

I guess it makes sense to rescale the training data at some point to a unified pixel size and always require input tomograms to be close to pixel size 10A (or whatever value we choose).

alisterburt commented 4 months ago

The training data aren't at the same pixel size?

I was actually asking about the amplitude spectrum scaling though, I guess it's safe to assume that is also not controlled for?

Both things could be controlled for in the training data, giving us a fixed target to aim for, then varied as augmentations, what do you think?

Naively, not forcing the model to learn to deal with pixel size/spectrum scaling should increase performance if we can ensure the same pixel size/amplitude scale at inference time?

LorenzLamm commented 4 months ago

That's an interesting idea! I have never thought of normalizing the amplitude spectra.

Generally, I'm more in favor of stronger data augmentations to make the model more robust, compared to normalization (even though combined with data augmentation). That's also why we randomly rescaled Fourier components during our data augmentation, and avoided imitating styles of certain tomograms by matching to specific spectra.

But it definitely sounds like something we could try. Would be interesting to see how it affects the model performance.

teamtomo / membrain-seg

on-the-fly preprocessing #55