ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

Race condition in storr causes strange red flags (not acual errors) in the long testing workflow #404

Closed wlandau closed 6 years ago

wlandau commented 6 years ago

TL;DR

I think this is all due to https://github.com/richfitz/storr/issues/80.

Overview

I am seeing strange results from drake's long tests. I am struggling to find the root causes, and I have not been able to create a reproducible example for any of the elusive problems below. If anyone else has encountered similar issues, I would be super grateful for advice. I am delaying the next CRAN submission to see if we can resolve some of these lurking mysteries.

The good news

The long tests seem to be passing. Except for the odd edge case on Windows (such as broken parallel socket connections) drake is delivering the output that the unit tests expect. And with the exception of https://github.com/mllg/batchtools/issues/197, my own real-world projects run just fine with development drake.

A bit about drake's testing workflow

The long tests

All of drake's officially supported modes of parallel computing need to function properly, and not just in the quick tests that run on CRAN. That's why I implemented the internal test_scenarios() function, which runs all the unit tests under each of these testing scenarios. This file loops through all those scenarios, and I run it on Linux and Windows before each CRAN submission. I also test all the modes of parallel computing on some real-world projects I keep in-house.

Temporary directories for testing

Every test runs inside a call to test_with_dir() rather than just test_that(). That way, the tests run in their own temporary directories so we do not need to worry about cleaning up file output.

The strange behavior

Strange behavior

sh: 0: getcwd() failed: No such file or directory

This one gets me on Linux systems. Apparently, it happens when one tries to execute a command from a non-existent directory.

> withr::with_options(
+   list(drake_no_processx = TRUE),
+   test_scenarios("local_future_lapply_2_future")
+ )
local_future_lapply_2_future::multisession: run
local_future_lapply_2_future::multisession: cache: ............................................................................................................................
local_future_lapply_2_future::multisession: command changes: ........
local_future_lapply_2_future::multisession: console: ................................................
local_future_lapply_2_future::multisession: custom caches: .............................................
local_future_lapply_2_future::multisession: dbi cache: .......
local_future_lapply_2_future::multisession: dependencies: .............................................
local_future_lapply_2_future::multisession: deprecation: .................................................................
local_future_lapply_2_future::multisession: edge cases: ..................................
local_future_lapply_2_future::multisession: envir: .......
local_future_lapply_2_future::multisession: examples: ...................................
local_future_lapply_2_future::multisession: expose imports: ...
local_future_lapply_2_future::multisession: full build: .......................................
local_future_lapply_2_future::multisession: future: .............
local_future_lapply_2_future::multisession: generate: ......................................
local_future_lapply_2_future::multisession: graph: .................................
local_future_lapply_2_future::multisession: hash: ...................
local_future_lapply_2_future::multisession: import file: .........
local_future_lapply_2_future::multisession: import object: ..........
local_future_lapply_2_future::multisession: intermediate file: ......
local_future_lapply_2_future::multisession: knitr: ...................................................
local_future_lapply_2_future::multisession: lazy load: .................
local_future_lapply_2_future::multisession: Makefile: ...................................................
local_future_lapply_2_future::multisession: memory cache: ................................................
local_future_lapply_2_future::multisession: migrate: ................................
local_future_lapply_2_future::multisession: mtcars example: .......................
local_future_lapply_2_future::multisession: namespaced: .........
local_future_lapply_2_future::multisession: other features: .............................................................................................
local_future_lapply_2_future::multisession: parallel: ........................................................
local_future_lapply_2_future::multisession: queue: .......................................
local_future_lapply_2_future::multisession: reproducible random numbers: ..........................
local_future_lapply_2_future::multisession: retries and timeouts: ..............
local_future_lapply_2_future::multisession: strings: ............
local_future_lapply_2_future::multisession: testing: ..........................................................
local_future_lapply_2_future::multisession: tidy eval: ....
local_future_lapply_2_future::multisession: time: .................................
local_future_lapply_2_future::multisession: triggers: ............................................
local_future_lapply_2_future::multisession: with processx: 
local_future_lapply_2_future::multisession: workflow plan: .............................................................................................

══ DONE ════════════════════════════════════════════════════════════════════════
> 
> proc.time()
   user  system elapsed 
110.828  47.159 325.274 
sh: 0: getcwd() failed: No such file or directory
sh: 0: getcwd() failed: No such file or directory
sh: 0: getcwd() failed: No such file or directory
sh: 0: getcwd() failed: No such file or directory

Mystery 1's in testthat output

For the scenarios with up to 9 jobs/workers, several 1's appear in the testthat log, but no errors are actually reported.

> withr::with_options(
+   list(drake_no_processx = TRUE),
+   test_scenarios("local_mclapply_staged_9")
+ )
local_mclapply_staged_9: run
local_mclapply_staged_9: cache: ............................................................................................................................
local_mclapply_staged_9: command changes: ......1..
local_mclapply_staged_9: console: ................................................
local_mclapply_staged_9: custom caches: ..................................1....1.......
local_mclapply_staged_9: dbi cache: .......
local_mclapply_staged_9: dependencies: .............................................
local_mclapply_staged_9: deprecation: .................................................................
local_mclapply_staged_9: edge cases: .1.............................1....
local_mclapply_staged_9: envir: .......
local_mclapply_staged_9: examples: ...................................
local_mclapply_staged_9: expose imports: ...
local_mclapply_staged_9: full build: ...1......1..............................
local_mclapply_staged_9: future: .............
local_mclapply_staged_9: generate: ......................................
local_mclapply_staged_9: graph: .................................
local_mclapply_staged_9: hash: ...................
local_mclapply_staged_9: import file: ...1.1..1.1..
local_mclapply_staged_9: import object: 1.1.1.1.1..1..1..
local_mclapply_staged_9: intermediate file: 1..1..1..
local_mclapply_staged_9: knitr: ...................................................
local_mclapply_staged_9: lazy load: ...1..1........1....
local_mclapply_staged_9: Makefile: .........................................1.......11...
local_mclapply_staged_9: memory cache: ................................................
local_mclapply_staged_9: migrate: ..............................1..
local_mclapply_staged_9: mtcars example: .1......................
local_mclapply_staged_9: namespaced: ....1....1.
local_mclapply_staged_9: other features: .........................................................................1....................
local_mclapply_staged_9: parallel: ........................................................
local_mclapply_staged_9: queue: .......................................
local_mclapply_staged_9: reproducible random numbers: ..........................
local_mclapply_staged_9: retries and timeouts: ..............
local_mclapply_staged_9: strings: ............
local_mclapply_staged_9: testing: ..........................................................
local_mclapply_staged_9: tidy eval: ....
local_mclapply_staged_9: time: .................................
local_mclapply_staged_9: triggers: ............................................
local_mclapply_staged_9: with processx: 
local_mclapply_staged_9: workflow plan: .............................................................................................

══ DONE ════════════════════════════════════════════════════════════════════════
> 

Permissions warnings on Windows

From what I saw, some internal storr files are created with permissions 666.

> withr::with_options(
+   list(drake_no_processx = TRUE),
+   test_scenarios()
+ )

local_parLapply_1: run
local_parLapply_1: cache: ............................................................................................................................
local_parLapply_1: command changes: ........
local_parLapply_1: console: ................................................
local_parLapply_1: custom caches: .............................................
local_parLapply_1: dbi cache: .......
local_parLapply_1: dependencies: .............................................
local_parLapply_1: deprecation: .................................................................
local_parLapply_1: edge cases: ..................................
local_parLapply_1: envir: .......
local_parLapply_1: examples: ...................................
local_parLapply_1: expose imports: ...
local_parLapply_1: full build: .......................................
local_parLapply_1: future: WWWWW.............
local_parLapply_1: generate: ......................................
local_parLapply_1: graph: .................................
local_parLapply_1: hash: ...................
local_parLapply_1: import file: .........
local_parLapply_1: import object: ..........
local_parLapply_1: intermediate file: ......
local_parLapply_1: knitr: ...................................................
local_parLapply_1: lazy load: .................
local_parLapply_1: Makefile: ...................................................
local_parLapply_1: memory cache: ................................................
local_parLapply_1: migrate: ................................
local_parLapply_1: mtcars example: .......................
local_parLapply_1: namespaced: .........
local_parLapply_1: other features: .............................................................................................
local_parLapply_1: parallel: ......................................S......S...WWWWWWWWW.
local_parLapply_1: queue: .......................................
local_parLapply_1: reproducible random numbers: ..........................
local_parLapply_1: retries and timeouts: ..............
local_parLapply_1: strings: ............
local_parLapply_1: testing: ..........................................................
local_parLapply_1: tidy eval: ....
local_parLapply_1: time: .................................
local_parLapply_1: triggers: ............................................
local_parLapply_1: with processx:
local_parLapply_1: workflow plan: .............................................................................................

== Skipped =====================================================================

1. mclapply and lapply (@test-parallel.R#98) - On windows
2. staged mclapply and lapply (@test-parallel.R#126) - On windows

== Warnings ====================================================================

1. future package functionality (@test-future.R#12) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d034c26441\.drake/keys/mc_status/MQ', reason 'Permission denied'

2. future package functionality (@test-future.R#12) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d034c26441\.drake/keys/mc_status/MQ', reason 'Permission denied'

3. future package functionality (@test-future.R#12) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d034c26441\.drake/keys/mc_status/MQ', reason 'Permission denied'

4. future package functionality (@test-future.R#12) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d034c26441\.drake/keys/mc_status/MQ', reason 'Permission denied'

5. future package functionality (@test-future.R#12) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d034c26441\.drake/keys/mc_status/MQ', reason 'Permission denied'

6. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

7. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

8. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

9. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

10. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

11. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

12. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

13. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

14. worker & priority cols don't generate overt problems (@test-parallel.R#162) - cannot remove file 'C:\Users\me\AppData\Local\Temp\Rtmpqczmys\file33d069062cd3\.drake/keys/mc_status/MQ', reason 'Permission denied'

Warnings about renaming storr files

These warnings are extremely rare and are not associated with any errors. All I know is that they are generated inside an internal storr operation.

1: In file.rename(tmp, filename) :

  cannot rename file '/tmp/RtmpD6jv91/file104c4763d28d/.drake/scratch/502159fd1e199751.rds' to '/tmp/RtmpD6jv91/file104c4763d28d/.drake/data/502159fd1e199751.rds', reason 'No such file or directory'

2: In file.rename(tmp, filename) :

  cannot rename file '/tmp/RtmpD6jv91/file104c4763d28d/.drake/scratch/e7fd4aafb0066660.rds' to '/tmp/RtmpD6jv91/file104c4763d28d/.drake/data/e7fd4aafb0066660.rds', reason 'No such file or directory'
wlandau commented 6 years ago

One thing I think I can do is re-architect master/worker communication such that the master and workers do not try to write to the same files. See the separate_messages branch (mclapply.R) which is only a sketch so far. Essentially, the master assigns targets to one storr namespace, and the worker delivers them to another storr namespace. If the assigned target and delivered target disagree, the worker is assumed to be running. If they agree, then the worker is idle and the master can assign it another target.

wlandau commented 6 years ago

Update: the separate_messages branch seems to only create more of the same problems, and I am not sure I will return to it.

On the other hand, many of these quirks have to do with calls to set_progress(). If I suppress those warnings, as well as any potential warnings from the interprocess communication in mclapply.R, all the red flags go away except the "sh: 0: getcwd()" messages in the future-related scenarios.

wlandau commented 6 years ago

I am almost certain that most of this is due to https://github.com/richfitz/storr/issues/80. Running the long tests on Windows to confirm...

wlandau commented 6 years ago

If I am right, we don't need to worry so much: the data objects are small, they get written anyway, and the actual output from drake is just fine.

wlandau commented 6 years ago

But regardless, I think we should still fix https://github.com/richfitz/storr/issues/80.

wlandau commented 6 years ago

After 4590c186bec2e73049e1fefced654f46648b4095, I have seen consistently clean tests, enough to close this issue. We can probably start ramping up to the CRAN release of version 5.2.0.