Closed delarosatrevin closed 3 years ago
Thanks @delarosatrevin , in both cases, number of coordinates seem too low? or are big ones?
Sorry, I meant 170 mics (removed extra 0 now)
I think recently we added the option to take the box size from cryolo. For that we need a first batch....maybe the paralellization does not work with this approach now.
Ok, so 7k seems reasonable? and 3k seems a bug?
I would say 7k is better, but independently of that, the number of threads/GPUs should not change the results, only speed it up. Output coordinates should be the same for both cases.
I agree, just asking to understand the case.
@pconesa @JorMaister This has been reported almost one month ago. Any progress on this? cryolo picking is one of the more popular pickers right now and in Scipion it is impossible to use more than one GPU, what is a big limitation for a real project. I have found that the protocol also fails sometimes. My guess is that when using several threads they are using the same folder for the output and there are some race conditions with the files. I guess then in some cases there are missing coordinate files...and in others, it just fails.
There hasn't been any progress on this. We are focused in tomography. We will try to prioritize this.
@delarosatrevin I can confirm you findings, with 3 gpus, 4 threads and number of cpus = 1 or 4 I do have mics with 0 particles, though the protocol says it has not failed. Probably this is where it races:
00156: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_1-3/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00157: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_1-3/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/
00158: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_4-6/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00159: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_4-6/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/
00160: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_7-9/CBOX/* Runs/000215_SphireProtCRYOLOPicking/tmp/outputCBOX/
00161: mv Runs/000215_SphireProtCRYOLOPicking/tmp/micrographs_7-9/DISTR/* Runs/000215_SphireProtCRYOLOPicking/extra/outputDISTR/
I found a very strange behavior when using cryolo picking.
pyworkflow: 3.0.5 plugin: sphire v: 3.0.4
I was running relion ' motioncor2 and the stream output used as input for cryolo picking.
The first issue was that running with 5 threads and 4 GPUs (0 1 2 3) and 4 cores per process, I had many micrographs with 0 coordiantes. So, for an input of approx 170 mics, there was only 3k output coordinates. (batch size of 16 mics)
Then, if I run the same jobs, with the same parameters, except only 1 threads and GPU 0, all micrographs get picked and the output set of coordinates for the same number of input mics was about 7k coordinates!
@pconesa @azazellochg This is really bad for using multi-GPU machines (either local or cluster) to pick and I'm worried if this is only cryolo or the base picking class with threads and streaming and batches.