scipion-em / scipion-em-sphire

Plugin to use Sphire programs within the Scipion framework
GNU General Public License v3.0
0 stars 1 forks source link

Different results 1 GPU vs 4 GPUs #46

Closed delarosatrevin closed 3 years ago

delarosatrevin commented 3 years ago

I found a very strange behavior when using cryolo picking.

pyworkflow: 3.0.5 plugin: sphire v: 3.0.4

I was running relion ' motioncor2 and the stream output used as input for cryolo picking.

The first issue was that running with 5 threads and 4 GPUs (0 1 2 3) and 4 cores per process, I had many micrographs with 0 coordiantes. So, for an input of approx 170 mics, there was only 3k output coordinates. (batch size of 16 mics)

Then, if I run the same jobs, with the same parameters, except only 1 threads and GPU 0, all micrographs get picked and the output set of coordinates for the same number of input mics was about 7k coordinates!

@pconesa @azazellochg This is really bad for using multi-GPU machines (either local or cluster) to pick and I'm worried if this is only cryolo or the base picking class with threads and streaming and batches.

pconesa commented 3 years ago

Thanks @delarosatrevin , in both cases, number of coordinates seem too low? or are big ones?

delarosatrevin commented 3 years ago

Sorry, I meant 170 mics (removed extra 0 now)

pconesa commented 3 years ago

I think recently we added the option to take the box size from cryolo. For that we need a first batch....maybe the paralellization does not work with this approach now.

pconesa commented 3 years ago

Ok, so 7k seems reasonable? and 3k seems a bug?

delarosatrevin commented 3 years ago

I would say 7k is better, but independently of that, the number of threads/GPUs should not change the results, only speed it up. Output coordinates should be the same for both cases.

pconesa commented 3 years ago

I agree, just asking to understand the case.

delarosatrevin commented 3 years ago

@pconesa @JorMaister This has been reported almost one month ago. Any progress on this? cryolo picking is one of the more popular pickers right now and in Scipion it is impossible to use more than one GPU, what is a big limitation for a real project. I have found that the protocol also fails sometimes. My guess is that when using several threads they are using the same folder for the output and there are some race conditions with the files. I guess then in some cases there are missing coordinate files...and in others, it just fails.

pconesa commented 3 years ago

There hasn't been any progress on this. We are focused in tomography. We will try to prioritize this.

azazellochg commented 3 years ago

@delarosatrevin I can confirm you findings, with 3 gpus, 4 threads and number of cpus = 1 or 4 I do have mics with 0 particles, though the protocol says it has not failed. Probably this is where it races:

00156:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_1-3/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00157:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_1-3/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/
00158:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_4-6/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00159:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_4-6/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/
00160:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_7-9/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00161:   mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_7-9/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/