Closed jibanes closed 4 years ago
It looks like you're working on a cluster (the hostnames
option was being used). Do you get the error if you keep it local to one machine? It might be something with the ssh connections timing out.
Brian,
I will do a run without hostnames(), stay tuned.
Brian,
Same error: "No dataset for instance 0002."
Same script as above but I removed hostnames() from the parallel command, I left ssh() and procexec(). parallel initialize 36, f statapath(XXX) ssh("ssh -o 'StrictHostKeyChecking no' -q") procexec(2) Where XXX is the location of the binary.
It failed after a few dozen successful runs (~30-45 mns).
Can you remove the ssh
as well? (Also, procexec
is only for Windows so you can remove that). I don't think this is your issue, but on some clusters the tmp space is cleared out periodically, so I've had to start Stata with a temp directory that was local to my username. Also can you try to view the log from the failed subprocess?
Brian,
Same error "No dataset for instance 0002." with: parallel initialize 36, f statapath(XXX)
It took roughly the same number of successful tries before it failed.
I have not noticed any error in the corresponding log files (which correspond to the error) see below.
__pllfaczc1j119_do0001.log __pllfaczc1j119_do0036.log __pllfaczc1j119_do0035.log __pllfaczc1j119_do0034.log __pllfaczc1j119_do0033.log __pllfaczc1j119_do0032.log __pllfaczc1j119_do0031.log __pllfaczc1j119_do0030.log __pllfaczc1j119_do0029.log __pllfaczc1j119_do0027.log __pllfaczc1j119_do0026.log __pllfaczc1j119_do0028.log __pllfaczc1j119_do0025.log __pllfaczc1j119_do0024.log __pllfaczc1j119_do0023.log __pllfaczc1j119_do0022.log __pllfaczc1j119_do0021.log __pllfaczc1j119_do0020.log __pllfaczc1j119_do0019.log __pllfaczc1j119_do0018.log __pllfaczc1j119_do0017.log __pllfaczc1j119_do0016.log __pllfaczc1j119_do0015.log __pllfaczc1j119_do0014.log __pllfaczc1j119_do0013.log __pllfaczc1j119_do0012.log __pllfaczc1j119_do0011.log __pllfaczc1j119_do0010.log __pllfaczc1j119_do0009.log __pllfaczc1j119_do0008.log __pllfaczc1j119_do0006.log __pllfaczc1j119_do0005.log __pllfaczc1j119_do0004.log __pllfaczc1j119_do0007.log __pllfaczc1j119_do0003.log __pllfaczc1j119_do0002.log
Contains all dta, do, sh files.
Note: I replaced the paths with "XXX" manually from the output below.
. parallel initialize 36, f statapath(XXX)
N Child processes: 36
Stata dir: XXX
. parallel, prog(parfor): parfor y_pll
--------------------------------------------------------------------------------
Exporting the following program(s): parfor
parfor:
1. args var
2. di "`c(hostname)'"
3. di "`=_N' obs"
4. forval i=1/`=_N' {
5. qui replace `var' = sqrt(x) in `i'
6. replace hostname = "`c(hostname)'"
7. }
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Parallel Computing with Stata
Child processes: 36
pll_id : hr8undy119
Running at : XXX
Randtype : datetime
Waiting for the child processes to finish...
child process 0001 has exited without error...
child process 0002 has exited without error...
child process 0003 has exited without error...
child process 0004 has exited without error...
child process 0005 has exited without error...
child process 0006 has exited without error...
child process 0007 has exited without error...
child process 0008 has exited without error...
child process 0009 has exited without error...
child process 0010 has exited without error...
child process 0011 has exited without error...
child process 0012 has exited without error...
child process 0013 has exited without error...
child process 0014 has exited without error...
child process 0015 has exited without error...
child process 0016 has exited without error...
child process 0017 has exited without error...
child process 0018 has exited without error...
child process 0019 has exited without error...
child process 0020 has exited without error...
child process 0021 has exited without error...
child process 0022 has exited without error...
child process 0023 has exited without error...
child process 0024 has exited without error...
child process 0025 has exited without error...
child process 0026 has exited without error...
child process 0027 has exited without error...
child process 0028 has exited without error...
child process 0029 has exited without error...
child process 0030 has exited without error...
child process 0031 has exited without error...
child process 0032 has exited without error...
child process 0033 has exited without error...
child process 0034 has exited without error...
child process 0035 has exited without error...
child process 0036 has exited without error...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------
No dataset for instance 0002.
r(601);
[...]
I've made an interesting discovery; and explored a few options.
First, it doesn't look like a file descriptors exhaustion, I was wondering if that would be preventing an append function. I've noticed after repeated tests that it always fails at the 60th try, my guess is that it's how deep a recursion can go (nested do operations); if you look at the attached (above) parallel.do script, you will see that it calls itself, and 60 levels must be either exhausting a local resource or just the max level of nested operations.
As such, I have repeatedly called parallel.do without the recursion 1000 times independently, not a single one failed; but they fail repeatedly after 60 tries using nested calls.
Brian, does this sound like a possibility, that the recursion in the do script would cause an append function to fail?
That sounds right. I think I've hit recursion limits before in Stata, though it's been a while. I suppose it could be tested w/o doing parallel to make sure. Seems like there might not be much we can do at our end, except note the issue.
I agree. Before I close the issue, do you see any issues having twice (or more) the same hostname in the hostnames() argument, in order to balance the payload on multiple machines of different speeds in a more efficient manner? I've done some testing and it looks fine. i.e. hostnames("a a a b c c c c") assuming here that a has 3x more cores than b, and c has 4x more cores than b for instance.
Having duplicate hostnames should fine. That's a good use for them.
Thank you both! George G. Vega Yon +1 (626) 381 8171 https://ggvy.cl
On Wed, Apr 29, 2020 at 11:26 AM Brian Quistorff notifications@github.com wrote:
Closed #83 https://github.com/gvegayon/parallel/issues/83.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gvegayon/parallel/issues/83#event-3285882935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2FM7R6DOQODYEOSX3I3DRPBWMFANCNFSM4MSEV4QA .
Preliminaries
Before submitting an issue, please check (with
x
in brackets) that you:Expected behavior and actual behavior
Described what you expected to see and what you actually see
I'm running the attached Stata do file in a loop, it fails after many iterations, typically 1-2 hours with the error message "No dataset for instance 0002." while I do see __pllXXX_dta0002.dta on disk (which is use/append-able). I've repro'd this 4 times, and everytime the error message pointed to instance 0002.
Important datapoint: "it fails after many iterations". All the previous iterations, sometimes dozens, sometimes in the hundred range SUCCEED; this is random failure, which I have found no other way to reproduce but by having the do script call itself, and wait for a few hours (typically 1-2). The NFS is used by a large number of machines with no known issue/outage (it's a commercial NFS appliance, from a well-known Fortune 500 company).
See attached error below: error.txt
I've emailed the _pll files leading to the failure to @gvegayon .
Steps to reproduce the problem
attached, the script is calling itself, will typically fail within 1-2 hours. parallel.txt
System information
Some relevant information
I've repro'd using stata-mp and stata binaries.
Output from
creturn list
:System values
Directories and paths
System limits
Numerical and string limits
Current dataset
Memory settings
Output settings
Interface settings
Graphics settings
Network settings
Trace (program debugging) settings
Mata settings
Java settings
putdocx settings
Python settings
RNG settings
Unicode settings
Other settings
Other
.