Closed statquant closed 1 year ago
Is this from CRAN or Github?
It looks like a port binding failure that returns NA instead of retrying/raising an error. I'll need to look at the exact implementation (i.e., version) to see how this is possible.
I have the same issue with an SGE scheduler - although I do not have to wait long to have the session stop. I usually re-send the job again and again until it works.. Similar to @statquant, I can't reproduce it but it tends to occur when I am sending big data.
Also thanks for the great package :)
I also just ran into this with LSF, R 4.1.0 and clustermq_0.8.95.1 off CRAN - might be a good idea to check if the port is NA and fail if it is?
Ran into this recently again... pretty hard to reproduce. @mschubert since the NA
comes from casting the port to integer (as.integer(sub(".*:", "", private$master))
), would it be possible to add something along the lines of
if (is.na(private$port))
{
stop("Port is NA, aborting, address was: ", private$master)
}
after line 33 in qsys.r on master (https://github.com/mschubert/clustermq/blob/master/R/qsys.r#L33) or after line 21 on develop (https://github.com/mschubert/clustermq/blob/develop/R/qsys.r#L20). This would prevent R from hanging and hopefully give use an error message that gets closer to the root cause.
Regarding the port selection: what are your thoughts on making this configurable (at the moment the range of 6000:9999 is hard-coded in util.r in master at https://github.com/mschubert/clustermq/blob/master/R/util.r#L35)? Sampling some currently free ports might be more robust (e.g. using parallelly::freePort, see https://parallelly.futureverse.org/reference/freePort.html) than sampling 100 ports from a fixed port range without checking whether these ports are free?
I can confirm that port NA is generated when all sampled ports are in use (e.g. test by overriding the host method in the package):
library(clustermq)
cmq <- asNamespace("clustermq")
unlockBinding("host", cmq)
cmq$host <- function(
node=getOption("clustermq.host", Sys.info()["nodename"]),
ports=32781, # This port is in use
n=1
) {
utils::head(sample(sprintf("tcp://%s:%i", node, ports)), n)
}
fx = function(x) {
tibble(x = x)
}
Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )
with the modified error handing yields
> Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )
Error in super$initialize(..., template = template) : Port is NA, aborting, address was:
This apparently also happens when the first port in the list is in use, the others are not checked (!):
cmq$host <- function(
node=getOption("clustermq.host", Sys.info()["nodename"]),
ports=32781:38000, # This port is in use
n=1
) {
sprintf("tcp://%s:%i", node, ports)
}
> Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )
Error in super$initialize(..., template = template) : Port is NA, aborting, address was:
@mschubert a suggestion for host()
could be:
host <- function(
node=getOption("clustermq.host", Sys.info()["nodename"]),
ports=getOption("clustermq.portRange", 1024:65535),
n=20
) {
free_ports <- numeric(n) * NA
for (i in seq_len(n)){
free_ports[i] <- parallelly::freePort(ports, default = NA)
ports <- setdiff(ports, free_ports[i])
}
if (any(is.na(free_ports)))
{
stop("Free ports must not be NA")
}
sprintf("tcp://%s:%i", node, free_ports)
}
Is it possible that https://github.com/mschubert/clustermq/blob/master/src/CMQMaster.cpp#L19 has a bug:
std::string listen(Rcpp::CharacterVector addrs) {
int i;
for (i=0; i<addrs.length(); i++) {
auto addr = Rcpp::as<std::string>(addrs[i]);
try {
sock.bind(addr);
} catch(zmq::error_t const &e) {
if (errno != EADDRINUSE)
Rf_error(e.what());
}
return sock.get(zmq::sockopt::last_endpoint);
}
Rf_error("Could not bind port after ", i, " tries");
}
Shouldn't this read as follows (note the return statement location):
std::string listen(Rcpp::CharacterVector addrs) {
int i;
for (i=0; i<addrs.length(); i++) {
auto addr = Rcpp::as<std::string>(addrs[i]);
try {
sock.bind(addr);
return sock.get(zmq::sockopt::last_endpoint);
} catch(zmq::error_t const &e) {
if (errno != EADDRINUSE)
Rf_error(e.what());
}
}
Rf_error("Could not bind port after ", i, " tries");
}
Great catch @luwidmer, that return statement indeed looks off!
Note that it's fixed in develop
, but happy to merge a PR if you don't want to wait for that
@mschubert awesome, thanks! I patched this in the CRAN version for me, I'd be happy to wait for develop / the next version to hit CRAN. I use clustermq a lot 👍
What do you think of using parallelly in host
to pre-populate the list with some ports that (should) be free (barring a race condition where something else is grabbing a bunch of ports between the R part and the C++ call), and making the port range a package option as in https://github.com/mschubert/clustermq/issues/270#issuecomment-1480935856 ?
Making the port range configurable via an option makes sense, but I'm not sure I see the advantage of using parallely::freePort
?
Indeed... I suppose one could also just pass the entire port range into the C++ without pre-scanning in R! Thanks!
Hello, for some reasons that elude me and that I cannot reproduce (yeah that's not much to go to), sometimes I cannot send jobs to my slurm grid. What happen is I see
and then R just get stuck until I send an interrupt in the terminal. Then I have to wait really long to get an error and the terminal back (which might be a bug in itself ?) Given this
NA
I was wondering if there is not something that can be done. Note that when I rerun the same command later on all works well.Many thanks for the package, it's great and sorry for this unhelpful issue.