Avoid "Disk cache database error" when running multiple instances on HPC

mitsuba-renderer / drjit

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering

BSD 3-Clause "New" or "Revised" License

592 stars 43 forks source link

Avoid "Disk cache database error" when running multiple instances on HPC #115

Closed dekuenstle closed 1 year ago

dekuenstle commented 1 year ago

On our attempt to render many images simultaneously with mitsuba on an HPC system, Dr.Jit crashes if multiple renderings are scheduled to run on the same compute node:

Critical Dr.Jit compiler failure: jit_optix_check(): API error 7012 (OPTIX_ERROR_DISK_CACHE_DATABASE_ERROR): "Disk cache database error" in /project/ext/drjit-core/src/optix_api.cpp:382.

I assume that Dr.Jit tries to write a cache to a location where a previously started process has written its cache (and locked it). Could you please help us debugging, i.e. by answering (a) Where are the caches stored? and (b) is there any way to configure the cache location?

Thanks in advance!

wjakob commented 1 year ago

Looks like your ~/.drjit/optix7cache.db file was corrupted (which should not happen since OptiX locks the file when it is concurrently accessed). Is it possible that you are using NFS or a similar network file system? Possibly that defeats the mechanism.

dekuenstle commented 1 year ago

Thanks for the prompt response! I assume the problem is, that ~/.drjit is shared across all nodes and our FS is not handling the locking properly––is there a configuration (environment variable for the cache location) such that every instance can write its own directory?

wjakob commented 1 year ago

The path is computed here: https://github.com/mitsuba-renderer/drjit-core/blob/master/src/init.cpp#L88, and we don't provide a good way of customizing it atm. You could try overwriting the HOME environment variable as a workaround.

dekuenstle commented 1 year ago

Thanks, overwriting HOME appears to be a workaround for us, but could have side effects for others.

You might consider introducing a custom environment variable for the cache because this shared-home caching is the only issue that we observed with the massive parallelization of mitsuba/Dr.Jit on HPC clusters (and many HPC clusters have such a shared HOME). Otherwise, it works like a charm! Thanks for your work and the quick support :-)

wjakob commented 1 year ago

Out of interest, this is an HPC system with OptiX-capable GPUs? It sounds really fancy!

dekuenstle commented 1 year ago

It's a cluster that is typically used with deep learning frameworks (TensorFlow, PyTorch), so the nodes are equipped with 2080TI or V100 GPUs; they work for rendering many variations of a scene as well ;-)

n-kubiak commented 1 year ago

Hello, I'm facing the same problem, in more or less the same HPC setting. I was wondering if you're planning to address this in a future mitsuba/dr jit release, or should I try a workaround instead?

Thanks, NK

njroussel commented 1 year ago

@n-kubiak This is not something that's planned in the next release.

If you end up writing a patch for this, we'd welcome a PR :smiley: