gvegayon / parallel

PARALLEL: Stata module for parallel computing
https://rawgit.com/gvegayon/parallel/master/ado/parallel.html
MIT License
117 stars 26 forks source link

Child process exited with error 700 when using 2 nodes #93

Open mangelett opened 3 years ago

mangelett commented 3 years ago

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

Expected behavior and actual behavior

I'm trying to run the parallel command on two nodes of a HPC cluster using the hostnames option in parallel initialize. When I specify the hostnames, I obtained the error "child process 0002 Exited with error -700- while running the command/dofile (view log)...". The logfile __pll[pll_id]_do0002.log is empty.

The command works fine without the hostnames option (working only on one node).

Steps to reproduce the problem

The following code is saved in the file test_parallel.do:

parallel initialize 2, f h("localhost cn07") 
sysuse auto
parallel, by(foreign) : egen maxp = max(price)

The code is launched with the command stata test_parallel.do inside a SLURM batch file (which request the node cn07").

System information

Output from creturn list:

gvegayon commented 3 years ago

Working with Slurm can be tricky sometimes. One key issue I've seen in the past is nodes' to filesystems. For parallel to work, all nodes need to have I/O access to the data and tempfiles. This issue seems to be a bug. Thanks for reporting.

mangelett commented 3 years ago

Normally, the nodes have IO access to the data and tempfile : data are on a file system shared among the nodes and I set the TMPDIR variable to a folder on this shared file system (originally to not saturate the disk space of node)

gvegayon commented 3 years ago

Sorry for the late reply. Can you verify that Stata recognizes the TMPDIR variable as the shared path you specified when submitting the jobs?

mangelett commented 3 years ago

The command tempfile junk; display "`junk'" prints a tempfile which is in the shared folder that I specified in the TMPDIR variable. So it seems Stata recognizes the shared path. Besides, the logfile pllul97ezlin1do0001.log and pllul97ezlin1do0002.log are in this folder.