Closed wlandau closed 6 years ago
One thing I think I can do is re-architect master/worker communication such that the master and workers do not try to write to the same files. See the separate_messages
branch (mclapply.R) which is only a sketch so far. Essentially, the master assigns targets to one storr
namespace, and the worker delivers them to another storr
namespace. If the assigned target and delivered target disagree, the worker is assumed to be running. If they agree, then the worker is idle and the master can assign it another target.
Update: the separate_messages
branch seems to only create more of the same problems, and I am not sure I will return to it.
On the other hand, many of these quirks have to do with calls to set_progress()
. If I suppress those warnings, as well as any potential warnings from the interprocess communication in mclapply.R
, all the red flags go away except the "sh: 0: getcwd()
" messages in the future
-related scenarios.
I am almost certain that most of this is due to https://github.com/richfitz/storr/issues/80. Running the long tests on Windows to confirm...
If I am right, we don't need to worry so much: the data objects are small, they get written anyway, and the actual output from drake
is just fine.
But regardless, I think we should still fix https://github.com/richfitz/storr/issues/80.
After 4590c186bec2e73049e1fefced654f46648b4095, I have seen consistently clean tests, enough to close this issue. We can probably start ramping up to the CRAN release of version 5.2.0.
TL;DR
I think this is all due to https://github.com/richfitz/storr/issues/80.
Overview
I am seeing strange results from
drake
's long tests. I am struggling to find the root causes, and I have not been able to create a reproducible example for any of the elusive problems below. If anyone else has encountered similar issues, I would be super grateful for advice. I am delaying the next CRAN submission to see if we can resolve some of these lurking mysteries.The good news
The long tests seem to be passing. Except for the odd edge case on Windows (such as broken parallel socket connections)
drake
is delivering the output that the unit tests expect. And with the exception of https://github.com/mllg/batchtools/issues/197, my own real-world projects run just fine with developmentdrake
.A bit about drake's testing workflow
The long tests
All of
drake
's officially supported modes of parallel computing need to function properly, and not just in the quick tests that run on CRAN. That's why I implemented the internaltest_scenarios()
function, which runs all the unit tests under each of these testing scenarios. This file loops through all those scenarios, and I run it on Linux and Windows before each CRAN submission. I also test all the modes of parallel computing on some real-world projects I keep in-house.Temporary directories for testing
Every test runs inside a call to
test_with_dir()
rather than justtest_that()
. That way, the tests run in their own temporary directories so we do not need to worry about cleaning up file output.The strange behavior
Strange behavior
sh: 0: getcwd() failed: No such file or directory
This one gets me on Linux systems. Apparently, it happens when one tries to execute a command from a non-existent directory.
Mystery 1's in
testthat
outputFor the scenarios with up to 9 jobs/workers, several 1's appear in the
testthat
log, but no errors are actually reported.Permissions warnings on Windows
From what I saw, some internal
storr
files are created with permissions 666.Warnings about renaming
storr
filesThese warnings are extremely rare and are not associated with any errors. All I know is that they are generated inside an internal
storr
operation.