pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

Possible cache race on jvm dependency artifacts via bsp #21532

Open jgranstrom opened 2 weeks ago

jgranstrom commented 2 weeks ago

Describe the bug

Projects with multiple bsp groups seem to hit a cache race when materializing jars across the groups. I have a repro here and opening it in intellij will if not immediately after a few re-requests from bsp start failing due to read-only files hit in the target directory for bsp jars.

Seen in BSP logs from intellij:

Engine traceback:
  in root
    ..
  in pants.backend.scala.bsp.rules.bsp_scalac_options_request
    ..

...

native_engine.IntrinsicError: Error copying bytes from /Users/johngranstrom/.cache/pants/lmdb_store/immutable/files/e8/e805c5ccc0241ceab8d55795633f909eafe1826bdf32214fc2869ffaee61f086 
to /Users/johngranstrom/code/repros/bsprepro3/.pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar: 
Permission denied (os error 13) (-32602)

Flags on the target bsp jars are all over:

❯ ls -la .pants.d/bsp/jvm/resolves/jvm-default/lib/                                                                                                                                      content-staging-apps_us
total 21816
drwxr-xr-x  8 johngranstrom  staff      256 Oct 16 17:35 .
drwxr-xr-x  4 johngranstrom  staff      128 Oct 16 17:35 ..
-rw-r--r--  1 johngranstrom  staff    94887 Oct 16 17:35 dev.zio_izumi-reflect-thirdparty-boopickle-shaded_2.13_2.3.9.jar
-r-xr-xr-x  1 johngranstrom  staff   592307 Oct 16 17:32 dev.zio_izumi-reflect_2.13_2.3.9.jar
-rw-r--r--  1 johngranstrom  staff    55959 Oct 16 17:35 dev.zio_zio-internal-macros_2.13_2.1.11.jar
-rw-r--r--  1 johngranstrom  staff    26874 Oct 16 17:35 dev.zio_zio-stacktracer_2.13_2.1.11.jar
-r-xr-xr-x  1 johngranstrom  staff  4462042 Oct 16 17:32 dev.zio_zio_2.13_2.1.11.jar
-r-xr-xr-x  1 johngranstrom  staff  5924141 Oct 16 17:33 org.scala-lang_scala-library_2.13.14.jar

Pants version 2.22

OS MacOS

jgranstrom commented 2 weeks ago

@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.

tdyas commented 5 days ago

@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.

Not really. My intuition tells me that Pants is probably writing to the same directory tree at the same time. Maybe take a strace or dtrace of the pantsd process and see what system calls are being issued and see if any writes overlap to the same file?

jgranstrom commented 5 days ago

@tdyas do you have any instinct from above? I spent some time trying to isolate it further but I just got to possibly cache concurrency without much more unfortunately.

Not really. My intuition tells me that Pants is probably writing to the same directory tree at the same time. Maybe take a strace or dtrace of the pantsd process and see what system calls are being issued and see if any writes overlap to the same file?

I think so too. I can't get ktrace or simliar to play well with intellij executing it. But from it running into trying to write to read-only jars in .pants.d/jvm/resolves/*/libs and them having inconsistent flags, some being read-only some not, I think that it is fair to assume that is the cause. However, I'm not sure where that happens in order to figure out a fix. Do you have any pointers on where it is getting split up and might cause the conflicts?

jgranstrom commented 5 days ago

I figured these just create digests that are identical, and would have assumed that would be handled by the engine when it comes to writing to disk. But here that doesn't seem safe

jgranstrom commented 4 days ago

@tdyas I did manage to get some fs events for a jar hitting it

04:23:02  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.003678   python3.9
04:23:02  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.003448   python3.9
04:23:02  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000016   python3.9
04:23:02  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000010   python3.9
04:23:11  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000238   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000316   python3.9
04:23:11    WrData[A]       pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000093 W python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000008   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000008   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000009   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000009   python3.9

native_engine.IntrinsicError: Error copying bytes from 
/Users/johngranstrom/.cache/pants/lmdb_store/immutable/files/e8/e805c5ccc0241ceab8d55795633f909eafe1826bdf32214fc2869ffaee61f086 
to /Users/johngranstrom/code/repros/bsprepro3/.pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar: 
Permission denied (os error 13) (-32602)
jgranstrom commented 4 days ago

@tdyas I did manage to get some fs events for a jar hitting it

04:23:02  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.003678   python3.9
04:23:02  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.003448   python3.9
04:23:02  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000016   python3.9
04:23:02  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000010   python3.9
04:23:11  chmod             pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000238   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000316   python3.9
04:23:11    WrData[A]       pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000093 W python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000008   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000008   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000009   python3.9
04:23:11  open              pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar    0.000009   python3.9

native_engine.IntrinsicError: Error copying bytes from 
/Users/johngranstrom/.cache/pants/lmdb_store/immutable/files/e8/e805c5ccc0241ceab8d55795633f909eafe1826bdf32214fc2869ffaee61f086 
to /Users/johngranstrom/code/repros/bsprepro3/.pants.d/bsp/jvm/resolves/jvm-default/lib/dev.zio_izumi-reflect_2.13_2.3.9.jar: 
Permission denied (os error 13) (-32602)

The two open+chmods at the start seem to me like trying to operate on the same file twice at the same time as suspected

tdyas commented 4 days ago

Looking at the write_digest function in Rust and the Python call site, neither location debounces calls to write the same digest out to the same location.

Some solutions ideas:

  1. Take a lock based on the particular digest / location pair being written out and do not write the same digest to the same location concurrently.
  2. Use the engine itself to debounce writes (and cache successful writes) by implementing a new type of graph node to represent writing a digest (plus the corresponding intrinsic rule).
jgranstrom commented 4 days ago

Looking at the write_digest function in Rust and the Python call site, neither location debounces calls to write the same digest out to the same location.

Some solutions ideas:

  1. Take a lock based on the particular digest / location pair being written out and do not write the same digest to the same location concurrently.
  2. Use the engine itself to debounce writes (and cache successful writes) by implementing a new type of graph node to represent writing a digest (plus the corresponding intrinsic rule).

Feels like the second option would be the most correct, but the first probably the easiest to get in

benjyw commented 3 days ago

Eeep, this seems like a fairly major issue, and we're lucky not to have hit it sooner.

I will look into implementing something here. But first I will try and create a repro.

benjyw commented 3 days ago

To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.

tdyas commented 3 days ago

To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.

Correct. This appears to just be write contention.

jgranstrom commented 14 hours ago

Eeep, this seems like a fairly major issue, and we're lucky not to have hit it sooner.

I will look into implementing something here. But first I will try and create a repro.

@benjyw 🙏 I provided a minimal repro at https://github.com/jgranstrom/pants-repro-bsp-race for reproing the scenario

To clarify, this is nothing to do with any cache, as far as I can tell? This is straightforward write contention.

Yeah, I suspected cache at first simply because it didn't happen more often already. It can for sure be just write contention.