tbepler / topaz

Pipeline for particle picking in cryo-electron microscopy images using convolutional neural networks trained from positive and unlabeled examples. Also featuring micrograph and tomogram denoising with DNNs.
GNU General Public License v3.0
169 stars 63 forks source link

Question about cross-validation (choosing 'n' and number of epochs) #40

Closed 3rdfloorTom closed 4 years ago

3rdfloorTom commented 4 years ago

I'm trying to find the appropriate parameters for training using the cross-validation walk-through. On a manually picked data-set (~3000 picks), I fond a clear peak in the plot of auprc vs epoch for various n.

However, for a data-set in which I am trying re-pick using a more even particle distribution (had a severe orientation bias on DoG picks). I use ~90,000 picks for training with r=3 on 7.2A/px mics (bin8) of a ~120 A diameter D2 symmetric particle I get:

image

Is this hyperbolic character evidence of something that I messed up or that I need to tune a particular parameter?

If it helps, I expect 300-500 pctls per micrographs based on manually picking a few.

tbepler commented 4 years ago

This behaviour seems perfectly normal, even good, to me. Given 90,000 particles, I wouldn't expect the model to overfit without significantly increasing it's capacity and even then you would need to train for a long time. What I conclude from this plot is that you could keep training after 10 epochs and continue to see improvement in the model. Also, n=250 is best. One explanation may be that you may have a subset of micrographs with many fewer particles (causing your 300-500 particle estimate to be too high). I would try even smaller values of n to find the optimal setting.

Now, it isn't totally clear from your description, but if your training 90k particles are your orientation biased DoG picks, that would suggest another explanation. Because some of those particles/micrographs are being used for validation, what we are measuring is the ability of the model to discriminate those orientation biased particles from background and the missing particles. Thus, we might estimate that n should be smaller than you estimated by hand, because n is being chosen to only capture that subset of orientations. In that case, you may want to use larger n despite the apparent performance improvement of smaller values.

Hope that helps.

3rdfloorTom commented 4 years ago

Thanks for the fast response! I picked with DoG initially and then sub-selectd 2D classes to try to balance the under-represented views with the over-represented views. I'll move forward with n=250 and a larger n to see how the orientations pan out then.

Is there way to continue training from a prior model file if I wanted to try > 10 epochs without restarting from scratch?

tbepler commented 4 years ago

That functionality exists on the 'dev' branch, but is not in an official release yet.