Open jgranstrom opened 2 weeks ago
@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.
@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.
Not really. My intuition tells me that Pants is probably writing to the same directory tree at the same time. Maybe take a strace or dtrace of the pantsd process and see what system calls are being issued and see if any writes overlap to the same file?
@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.
Not really. My intuition tells me that Pants is probably writing to the same directory tree at the same time. Maybe take a strace or dtrace of the pantsd process and see what system calls are being issued and see if any writes overlap to the same file?
I think so too. I can't get ktrace or simliar to play well with intellij executing it. But from it running into trying to write to read-only jars in .pants.d/jvm/resolves/*/libs
and them having inconsistent flags, some being read-only some not, I think that it is fair to assume that is the cause. However, I'm not sure where that happens in order to figure out a fix. Do you have any pointers on where it is getting split up and might cause the conflicts?
I figured these just create digests that are identical, and would have assumed that would be handled by the engine when it comes to writing to disk. But here that doesn't seem safe
@tdyas I did manage to get some fs events for a jar hitting it
04:23:02 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.003678 python3.9
04:23:02 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.003448 python3.9
04:23:02 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000016 python3.9
04:23:02 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000010 python3.9
04:23:11 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000238 python3.9
04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000316 python3.9
04:23:11 WrData[A] pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000093 W python3.9
04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000008 python3.9
04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000008 python3.9
04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000009 python3.9
04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000009 python3.9
native_engine.IntrinsicError: Error copying bytes from
/Users/johngranstrom/.cache/pants/lmdb_store/immutable/files/e8/e805c5ccc0241ceab8d55795633f909eafe1826bdf32214fc2869ffaee61f086
to /Users/johngranstrom/code/repros/bsprepro3/.pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar:
Permission denied (os error 13) (-32602)
@tdyas I did manage to get some fs events for a jar hitting it
04:23:02 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.003678 python3.9 04:23:02 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.003448 python3.9 04:23:02 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000016 python3.9 04:23:02 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000010 python3.9 04:23:11 chmod pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000238 python3.9 04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000316 python3.9 04:23:11 WrData[A] pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000093 W python3.9 04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000008 python3.9 04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000008 python3.9 04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000009 python3.9 04:23:11 open pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar 0.000009 python3.9 native_engine.IntrinsicError: Error copying bytes from /Users/johngranstrom/.cache/pants/lmdb_store/immutable/files/e8/e805c5ccc0241ceab8d55795633f909eafe1826bdf32214fc2869ffaee61f086 to /Users/johngranstrom/code/repros/bsprepro3/.pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar: Permission denied (os error 13) (-32602)
The two open+chmods at the start seem to me like trying to operate on the same file twice at the same time as suspected
Looking at the write_digest
function in Rust and the Python call site, neither location debounces calls to write the same digest out to the same location.
Some solutions ideas:
Looking at the
write_digest
function in Rust and the Python call site, neither location debounces calls to write the same digest out to the same location.Some solutions ideas:
- Take a lock based on the particular digest / location pair being written out and do not write the same digest to the same location concurrently.
- Use the engine itself to debounce writes (and cache successful writes) by implementing a new type of graph node to represent writing a digest (plus the corresponding intrinsic rule).
Feels like the second option would be the most correct, but the first probably the easiest to get in
Eeep, this seems like a fairly major issue, and we're lucky not to have hit it sooner.
I will look into implementing something here. But first I will try and create a repro.
To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.
To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.
Correct. This appears to just be write contention.
Eeep, this seems like a fairly major issue, and we're lucky not to have hit it sooner.
I will look into implementing something here. But first I will try and create a repro.
@benjyw 🙏 I provided a minimal repro at https://github.com/jgranstrom/pants-repro-bsp-race for reproing the scenario
To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.
Yeah, I suspected cache at first simply because it didn't happen more often already. It can for sure be just write contention.
Describe the bug
Projects with multiple bsp groups seem to hit a cache race when materializing jars across the groups. I have a repro here and opening it in intellij will if not immediately after a few re-requests from bsp start failing due to read-only files hit in the target directory for bsp jars.
Seen in BSP logs from intellij:
Flags on the target bsp jars are all over:
Pants version 2.22
OS MacOS