quantixed / TrackMateR

Analysis of TrackMate XML outputs in R
https://quantixed.github.io/TrackMateR/
Other
12 stars 1 forks source link

TYPO: Parallelization does not happen on MS Windows #11

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

In:

https://github.com/quantixed/TrackMateR/blob/d6b981f00c4024d61ffae183b47aaf5a78fb8d2d/R/readTrackMateXML.r#L61

you're using %do%. That will run sequentially, regardless of your registerDoParallel() settings. You want to use %dopar%.

quantixed commented 1 year ago

No parallelisation for Windows users is intentional. I will leave it like this and keep the issue open

https://github.com/quantixed/TrackMateR/issues/9#issuecomment-1504858461

HenrikBengtsson commented 1 year ago

[...] on MS Windows, I assume.

Yes, foreach(...) %do% {...} works but foreach(...) %dopar% {...} doesn't. Mac is working fine. Windows gives:

Error in unserialize(socklist[[n]]) : error reading from connection
Error in serialize(data, node$con) : error writing to connection

I tried %dofuture% it also exits with an error which is at least more verbose (thank you @HenrikBengtsson !)

Error in unserialize(node$con) :
MultisessionFuture (doFuture2-1) failed to receive results from cluster RichSOCKnode #1 (PID 20064 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. Detected a non-exportable reference (‘externalptr’ of class ‘XMLInternalElementNode’) in one of the globals (‘subdoc’ of class ‘XMLNodeSet’) used in the future expression. The total size of the 3 globals exported is 10.28 MiB. There are three globals: ‘subdoc’ (10.28 MiB of class ‘list’), ‘attrName’ (1.79 KiB of class ‘character’) and ‘...future.x_ii’ (168 bytes of class ‘list’)

This is because your subdoc object in:

https://github.com/quantixed/TrackMateR/blob/d6b981f00c4024d61ffae183b47aaf5a78fb8d2d/R/readTrackMateXML.r#L62

is an object class XMLNodeSet that contains XMLInternalElementNode objects, which cannot be exported to another R process. If trying to use those in the other process, it crashes that other R processes (which is very harsh). You can read about in https://future.futureverse.org/articles/future-4-non-exportable-objects.html#package-xml, which also gives two workarounds.

I don't see how the non-exportable reference could cause the connection(s) to die because it all runs fine on Mac.

It "works fine", because forked parallel processing is used (which is supported on Linux and macOS, but not MS Windows). I put "works fine" in quotes, because I'm not 100% sure it is safe to do so, even if you don't experience crashes. This is something that the maintainer of XML would have to confirm.

From the above vignette, the first workaround would be to do:

    subdoc_m <- lapply(subdoc, FUN = xmlSerializeHook)
    dtf <- as.data.frame(foreach(name = attrName, .combine = cbind) %dofuture% {
      subdoc <- xmlDeserializeHook(subdoc_m)
      sapply(subdoc, xmlGetAttr, name)
    })
    rm(subdoc_m)

That would work regardless of platform and plan().

HenrikBengtsson commented 1 year ago

That said, subdoc is probably a much larger object than attrName. In your current implementation, you end up exporting the whole subdoc to all parallel workers, which is unnecessary. I would flip it around and do something like:

subdoc_m <- lapply(subdoc, FUN = xmlSerializeHook)
res <- foreach(el_m = subdoc_m, .combine = cbind) %dofuture% {
  el <- xmlDeserializeHook(el_m)
  sapply(attrName, FUN = function(name) xmlGetAttr(el, name))
}
rm(subdoc_m)

This would export exactly the subset of subdoc[_m] that each worker needs - not more, not less.

quantixed commented 1 year ago

I like the idea of using {future} and finally found some time to rewrite the core function to take advantage of this framework. In my testing, on macOS and on Windows, the original version is faster. The difference is significant on a 20 core Mac Studio (15 s vs 3 s), less so on an 8 core Windows box (51 s to 47 s). I'm not certain it's working correctly because I wasn't expecting such a slow down on macOS.

Mac - original version

> ptm <- proc.time()
> # perform parallel read
> # test if we are running on windows
> for(i in 1:5) {
+ if (.Platform[["OS.type"]] == "windows") {
+   cat("Collecting spot data...\n")
+   dtf <- as.data.frame(foreach(i = 1:length(attrName), .packages = c("foreach","XML"), .combine = cbind) %do% {
+     sapply(subdoc, xmlGetAttr, attrName[i])
+   })
+ } else {
+   cat(paste0("Collecting spot data. Using ",numCores," cores\n"))
+   dtf <- as.data.frame(foreach(i = 1:length(attrName), .combine = cbind) %dopar% {
+     sapply(subdoc, xmlGetAttr, attrName[i])
+   })
+ }
+   rm(dtf)
+ }
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
> proc.time() - ptm
   user  system elapsed 
 27.863  16.940   3.106 

Mac - new version

plan(multisession, workers = numCores)
> ptm <- proc.time()
> for(i in 1:5) {
+ cat(paste0("Collecting spot data. Using ",numCores," cores\n"))
+ subdoc_m <- lapply(subdoc, FUN = XML::xmlSerializeHook)
+ dtf <- as.data.frame(foreach(el_m = subdoc_m, .combine = rbind) %dofuture% {
+   el <- XML::xmlDeserializeHook(el_m)
+   sapply(attrName, FUN = function(name) xmlGetAttr(el, name))
+ })
+ rm(subdoc_m)
+ rm(dtf)
+ }
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
Collecting spot data. Using 20 cores
> proc.time() - ptm
   user  system elapsed 
 12.138   0.723  14.701 

on Windows - original code

> ptm <- proc.time()
> # perform parallel read
> # test if we are running on windows
> for(i in 1:5) {
+   if (.Platform[["OS.type"]] == "windows") {
+     cat("Collecting spot data...\n")
+     dtf <- as.data.frame(foreach(i = 1:length(attrName), .packages = c("foreach","XML"), .combine = cbind) %do% {
+       sapply(subdoc, xmlGetAttr, attrName[i])
+     })
+   } else {
+     cat(paste0("Collecting spot data. Using ",numCores," cores\n"))
+     dtf <- as.data.frame(foreach(i = 1:length(attrName), .combine = cbind) %dopar% {
+       sapply(subdoc, xmlGetAttr, attrName[i])
+     })
+   }
+   rm(dtf)
+ }
Collecting spot data...
Collecting spot data...
Collecting spot data...
Collecting spot data...
Collecting spot data...
> proc.time() - ptm
   user  system elapsed 
  45.72    1.47   47.20 

on Windows - new code

> plan(multisession)
> ptm <- proc.time()
> for(i in 1:5) {
+   cat(paste0("Collecting spot data. Using ",numCores," cores\n"))
+   subdoc_m <- lapply(subdoc, FUN = XML::xmlSerializeHook)
+   dtf <- as.data.frame(foreach(el_m = subdoc_m, .combine = rbind) %dofuture% {
+     el <- XML::xmlDeserializeHook(el_m)
+     sapply(attrName, FUN = function(name) xmlGetAttr(el, name))
+   })
+   rm(subdoc_m)
+   rm(dtf)
+ }
Collecting spot data. Using 8 cores
Collecting spot data. Using 8 cores
Collecting spot data. Using 8 cores
Collecting spot data. Using 8 cores
Collecting spot data. Using 8 cores
> proc.time() - ptm
   user  system elapsed 
  28.57    1.06   51.22 
HenrikBengtsson commented 1 year ago

Hello. In your original macOS code, you used:

registerDoParallel(numCores)

which launches forked parallel workers. To get forked workers when using futures, use:

plan(multicore, workers = numCores)

and compare with:

dtf <- as.data.frame(foreach(i = 1:length(attrName), .combine = cbind) %dofuture% {
   sapply(subdoc, xmlGetAttr, attrName[i])
})

That will give you a more fair comparison. (Yes, that wipes any potential problem from using forked processing, as mention in https://github.com/quantixed/TrackMateR/issues/11#issuecomment-1505514406, under the rug).

If you want to compare using non-forked parallelization, then use:

cl <- makeCluster(numCores)
on.exit(stopCluster(cl))
registerDoParallel(cl)

in your "original" macOS code and compare it to your new macOS code (as you have above). Note that this solution requires xmlSerializeHook()/xmlDeserializeHook().

Regarding performance findings for MS Windows (and most likely also macOS): It could very well be that the performance overhead, from using xmlSerializeHook(), sending objects over, and then xmlDeserializeHook() on the other end, is greater than the gain achieved from running in parallel.

quantixed commented 1 year ago

Hi Henrik, thank you for the explanation.

I have retested using your suggested modifications. I now get a fairer comparison. Also, your suspicion about xmlSerializeHook() and xmlDeserializeHook() is correct. This is definitely slowing things down.

I think my best option is to leave things as they were and close this issue. Thanks for all your help and input.