vercel / turborepo

Build system optimized for JavaScript and TypeScript, written in Rust
https://turbo.build/repo/docs
MIT License
26k stars 1.79k forks source link

[turborepo] Build step output files not restored with message "Skipping cache check for %v, outputs have not changed since previous run" #4137

Open kheyse-oqton opened 1 year ago

kheyse-oqton commented 1 year ago

What version of Turborepo are you using?

1.8.3

What package manager are you using / does the bug impact?

Yarn v1

What operating system are you using?

Mac

Describe the Bug

When rerunning build after cleaning all previous build output files, randomly the output files of some packages' build scripts are not created/restored. Turbo reports "Skipping cache check for %v, outputs have not changed since previous run". This seems to be wrong as the output files were deleted, so they have definitely changed.

Expected Behavior

Running any yarn script through turbo should always result in the output files of that yarn script to be present after completion.

To Reproduce

Steps:

  1. yarn clean
  2. rm -r node_modules/.cache/turbo
  3. turbo daemon restart
  4. turbo build

Outcome: All build steps report cache miss, executing and are executed. The output files of all build steps are created.

Continued steps:

  1. yarn clean
  2. rm -r node_modules/.cache/turbo
  3. turbo build

Outcome: Some packages' build step report Skipping cache check for *#build, outputs have not changed since previous run.. The output files of these build steps are not created (the build command is not executed) or restored from cache. The build of the project fails because some packages are corrupt.

Our only workaround is to run turbo with --no-daemon.

Our turbo.json roughly looks like this:

{
  "$schema": "https://turbo.build/schema.json",
  "pipeline": {
    "clean": {
      "cache": false
    },
    "clean:build": {
      "cache": false
    },
    "build": {
      "dependsOn": ["^build"],
      "outputs": ["lib/**", "dist/**", "build/**"]
    },
  }
}

We have about 70 packages in our monorepo. The build step of each package is a basic tsc -b. Unfortunately I cannot share access to our repo, nor have I been able to recreate the issue in a smaller reproduction repo.

I'm running on an M1 mac with mac os 13.2.1.

Reproduction Repo

No response

kheyse-oqton commented 1 year ago

From the code (https://github.com/vercel/turbo/blob/4dcf1337781c7a5ba93e1fce347448f5f59310bd/cli/internal/runcache/runcache.go#L126-L146) i seem to understand that "skipping cache check" means turbo will not try if it needs to restore from the cache (https://github.com/vercel/turbo/blob/4dcf1337781c7a5ba93e1fce347448f5f59310bd/cli/internal/runcache/runcache.go#L130) or rerun the step (https://github.com/vercel/turbo/blob/4dcf1337781c7a5ba93e1fce347448f5f59310bd/cli/internal/runcache/runcache.go#L137) at all. It just decides that because the output files have not been touched since the last time the step was run, nothing needs to be done. And because the daemon makes an incorrect decision on that point (the output files were deleted, so they were definitely changed), the outcome (=output files) of the build step is incorrect/missing. Disabling the daemon disables that specific check and thus works around the bug.

From the code I don't understand though how checking if the output files have changed alone is sufficient, as clearly the input files need to be watched too to determine if a build step needs to be rerun.

AuthorityFX commented 1 year ago

I have the same issue on v1.8.2

1) git clean -fdx 2) pnpm i 3) npx turbo run build

The hash checks should not be skipped without explicit opt out.

I was v1.2.16 before upgrading to v1.8.2

lstkz commented 1 year ago

I have the same issue, but it occurs only if two commands share the same folder.

    "build:mjs2": {
      "outputs": ["dist/**/*.js"]
    },
    "build:types2": {
      "outputs": ["dist/**/*.d.ts", "dist/**/*.d.ts.map"]
    },
    "build": {
      "dependsOn": ["build:types2", "build:mjs2"]
    }

v1.8.3

arlyon commented 1 year ago

Hi, this is a pretty important thing to get right so thanks for taking the time to report this. I am having issue reproducing your problem, are you able to provide a repository that exhibits this behaviour?

LukeTillman commented 1 year ago

I was able to reproduce this with a pretty simple setup:

https://github.com/LukeTillman/turborepo-daemon-bug

I put some details in the README, but to repeat them here, I think this might be timing-related somehow and so there doesn't seem to be a deterministic way to trigger it. In that repository, I ran the build script multiple times in a row and it would often take 10+ times before the bug would happen. As I indicated in the README, I think you can tell when it happens by looking at the log output for the "skipping cache check" message.

EDIT: I just bumped Turbo to the latest release (1.8.5) from 1.8.3 and I'm still able to reproduce it.

kalvenschraut commented 1 year ago

Similar issue here. Only seems to occur when clean was done right before rebuilding, which my developers do a lot when switching branches. If the specific package/app have the same hash between branches then it could seemingly randomly fail. Unfortunately running without the daemon for everything going forward until this is resolved.

How is turbo determining if the outputs match? From that code snippet variables it would imply some kind of file system watcher? If so is the watcher to slow to notify turbo the output has changed, due to both clean and build being ran within the same make target?

arlyon commented 1 year ago

I am in the process of porting the daemon server to rust (inline with our general porting effort) which is what is responsible for this issue (the daemon, not the rust). For some reason, certain globs are not being reported correctly in some cases and the client is skipping restores. I assume it has to do with some files changing after the delta has been reported. macOS (more specifically, APFS) likes to report files being saved before they actually are... 😭 meaning the daemon can sometimes end up registering the wrong files as the outputs of a task. I am going to make sure that this issue (thanks for that excellent repro, @LukeTillman!) is covered when doing so.

The good/bad news is I can still recreate it on the rust version so we may have to have more stringent requirements for filesystem flushes on mac...

kheyse-oqton commented 1 year ago

I appreciate that this is on the roadmap to be fixed. From your description I may gather that this is not a very short term solution though? Would it be an option in the short term to make the daemon configurable in the turbo.json? https://github.com/vercel/turbo/issues/3455 At this point it's quite a bad experience for the developers in our monorepo. All the time they need to remember to add --no-daemon to their commands and still we get lots of questions from engineers who are stuck because they just forgot to disable the daemon and the build output is broken.

kalvenschraut commented 1 year ago

I have package.json scripts for everything so most of my developers just run pnpm build which is just an alias for pnpm turbo run --no-daemon build now

sonyseng commented 1 year ago

Saw that this is already being looked at but I am also seeing this issue on v1.8.6 on Windows 11 but using git bash in windows terminal

arlyon commented 1 year ago

So, the good news is that it's not daemon related at all, but more build graph related. I can see the repro repo has multiple outputs in the same folder which I think may be causing race conditions when caching. Do you have something similar, @sonyseng?

LukeTillman commented 1 year ago

So, the good news is that it's not daemon related at all, but more build graph related. I can see the repro repo has multiple outputs in the same folder which I think may be causing race conditions when caching. Do you have something similar, @sonyseng?

Not sure how close you are to figuring out where the race condition is, but I pushed up a branch in my repro repository where I split the build output into two subfolders of the overall dist folder and was still able to reproduce it:

https://github.com/LukeTillman/turborepo-daemon-bug/tree/seperate-build-output

The clean task there still just removes the overall dist folder though. Let me know what else might be helpful in tracking it down.

sonyseng commented 1 year ago

So, the good news is that it's not daemon related at all, but more build graph related. I can see the repro repo has multiple outputs in the same folder which I think may be causing race conditions when caching. Do you have something similar, @sonyseng?

That's correct. I have a root dist folder with output from multiple packages as subfolders. Similar to @LukeTillman just mentioned

StudioFlow commented 1 year ago

Hi, Same issue in turbo v1.9.3, yarn v1.22, ubuntu 20.4. Were you able to reproduce? Do you have a workaround to suggest? Regards,

s1mak commented 1 year ago

Hello, I had the same issue : after cleaning all my repo (task remove all _./nodemodules + ./package-lock.json + .turbo + ./dist) still having cache hit (from hell !) But version 1.9.4 seems to correct it !

So my boss allow us to continue POC for the company <3

Have a nice day

LukeTillman commented 1 year ago

Quick update: I bumped my repository with the reproduction above to the latest (1.9.8) and can still reproduce this.

$ turbo run build --filter=lib-a && ls lib-a/dist
• Packages in scope: lib-a
• Running build in 1 packages
• Remote caching disabled
lib-a:clean: cache bypass, force executing fa24bece907839e7
lib-a:clean:
lib-a:clean: > lib-a@0.1.0 clean
lib-a:clean: > rimraf dist/
lib-a:clean:
lib-a:build:esm: Skipping cache check for lib-a#build:esm, outputs have not changed since previous run.
lib-a:build:esm: cache hit, replaying output fa125029ffac1b06
lib-a:build:esm:
lib-a:build:esm: > lib-a@0.1.0 build:esm
lib-a:build:esm: > swc ./src -d ./dist --config-file ../.swcrc
lib-a:build:esm:
lib-a:build:esm: Successfully compiled: 1 file with swc (29.94ms)
lib-a:build:types: Skipping cache check for lib-a#build:types, outputs have not changed since previous run.
lib-a:build:types: cache hit, replaying output de724cb17bc6e142
lib-a:build:types:
lib-a:build:types: > lib-a@0.1.0 build:types
lib-a:build:types: > tsc
lib-a:build:types:

 Tasks:    3 successful, 3 total
Cached:    2 cached, 3 total
  Time:    307ms

ls: lib-a/dist: No such file or directory
AuthorityFX commented 1 year ago

I've tried a number of versions from v1.2.16 to v1.10.1 and this issue is persistent.

The issue is that cache check is being skipped just like shown here https://github.com/vercel/turbo/issues/4137#issuecomment-1559551654

The only solution is to run turbo pipeline with --force flag.

AuthorityFX commented 1 year ago

I believe this is the PR that introduced the regression. https://github.com/vercel/turbo/pull/1269

So > v1.3.1 will have this issue

LukeTillman commented 1 year ago

Another quick update, tried my repository (see above) with version 1.10.7 and the message in the output seems to have changed, but the bug remains the same:

$ turbo run build --filter=lib-a && ls lib-a/dist
• Packages in scope: lib-a
• Running build in 1 packages
• Remote caching disabled
lib-a:clean: cache bypass, force executing 577d0d931492f677
lib-a:clean:
lib-a:clean: > lib-a@0.1.0 clean
lib-a:clean: > rimraf dist/
lib-a:clean:
lib-a:build:esm: cache hit (outputs already on disk), replaying logs de912c80b8b1e466
lib-a:build:esm:
lib-a:build:esm: > lib-a@0.1.0 build:esm
lib-a:build:esm: > swc ./src -d ./dist --config-file ../.swcrc
lib-a:build:esm:
lib-a:build:esm: Successfully compiled: 1 file with swc (29.53ms)
lib-a:build:types: cache hit, replaying logs d306536c4194e8d3
lib-a:build:types:
lib-a:build:types: > lib-a@0.1.0 build:types
lib-a:build:types: > tsc
lib-a:build:types:

 Tasks:    3 successful, 3 total
Cached:    2 cached, 3 total
  Time:    372ms

index.d.ts     index.d.ts.map

Looks like the message when you trigger the bug is now:

cache hit (outputs already on disk), replaying logs

Anything we can do to help track this down? Happy to assist with reproducing, ruling things out, etc. however I can!

martinnabhan commented 1 year ago

This is happening to us too which unfortunately means we can't cache some commands. For some reason we get the outputs already on disk message even if the inputs are edited or the outputs have been deleted.

chinanderm commented 8 months ago

Running into this issue as well for multiple apps on Render.com.

arlyon commented 8 months ago

I am going on the bug offensive at the moment and would like to tackle this. I will probably get round to it friday, so if anyone is able to produce a reproduction repository for me before then I would really appreciate it

LukeTillman commented 8 months ago

@arlyon I updated my reproduction repository from above to the latest turbo (1.11.3) and am still able to reproduce it. It still doesn't happen every time you run the command, but if you run it repeatedly, eventually you'll see output like this:

$ turbo run build --filter=lib-a && ls lib-a/dist
• Packages in scope: lib-a
• Running build in 1 packages
• Remote caching disabled
lib-a:clean: cache bypass, force executing a81b2932357f1f45
lib-a:clean:
lib-a:clean: > lib-a@0.1.0 clean
lib-a:clean: > rimraf dist/
lib-a:clean:
lib-a:build:esm: cache hit (outputs already on disk), replaying logs 4e8fb4177844dd16
lib-a:build:esm:
lib-a:build:esm: > lib-a@0.1.0 build:esm
lib-a:build:esm: > swc ./src -d ./dist --config-file ../.swcrc
lib-a:build:esm:
lib-a:build:esm: Successfully compiled: 1 file with swc (10.1ms)
lib-a:build:types: cache hit, replaying logs 1b5a331dd0feb8a1
lib-a:build:types:
lib-a:build:types: > lib-a@0.1.0 build:types
lib-a:build:types: > tsc
lib-a:build:types:

 Tasks:    3 successful, 3 total
Cached:    2 cached, 3 total
  Time:    346ms

index.d.ts     index.d.ts.map

You can see from the ls output that only 2 out of the 4 files that should be there are present (in this example, the .js and .js.map files are missing from the build:esm task).

maschwenk commented 8 months ago

@LukeTillman is there any chance there is some kind of race condition between your tasks? Maybe worth posting the turbo config. If two tasks are writing to dist that can be problematic. Is this reproducible if you force Turbo to run serially? We run with a 240+ package monorepo, if this was a easily-reproducible bug I think we'd have seen it as well. Sorry just spitballin.

Also might be helpful to run with --summary flag and add that here

LukeTillman commented 8 months ago

@maschwenk I did some investigation way back in my original reproduction to see if splitting the output into subdirectories helped at all (see https://github.com/vercel/turbo/issues/4137#issuecomment-1491001975) but it didn't at that time. I could certainly update that branch and do it again if that's helpful for reproducing.

is there any chance there is some kind of race condition between your tasks?

Definitely possible, but I don't think so? Here's the task graph (running with the --graph flag):

tasks

Is this reproducible if you force Turbo to run serially?

I actually hadn't tried that yet, but now that you suggested it I gave it a shot (running the same command with --concurrency=1). I was able to reproduce it, but just like in the past, it doesn't happen every run, maybe 1 in 10.

Maybe worth posting the turbo config.

The setup in the repository was the simplest thing I could come up with to reproduce it (including having only a single package), but here it is for posterity:

{
  "$schema": "https://turborepo.org/schema.json",
  "pipeline": {
    "build": {
      "dependsOn": ["^build", "clean", "build:esm", "build:types"],
      "outputs": ["dist/**"]
    },
    "build:esm": {
      "dependsOn": ["clean"],
      "outputs": ["dist/**/*.js", "dist/**/*.js.map"]
    },
    "build:types": {
      "dependsOn": ["^build", "clean"],
      "outputs": ["dist/**/*.d.ts", "dist/**/*.d.ts.map"]
    },

    "clean": {
      "dependsOn": [],
      "cache": false
    }
  }
}

We run with a 240+ package monorepo, if this was a easily-reproducible bug I think we'd have seen it as well. Sorry just spitballin.

No need to apologize, I agree it definitely doesn't seem to be easily reproducible or there's something strange about this setup vs whatever you all are doing that makes it behave this way. Maybe it's the shared output folder even though the files being output are discrete? The fact that it doesn't happen every time you run the command certainly doesn't help either. We have a (private) repository with 10s of packages and it popped up frequently enough that I spent some time trying to reproduce it which is what ended up in my repro repository.

One thing that is kinda strange here is that the build task doesn't really exist as a script in the package.json, it's just a turbo.json definition with dependencies. I say kinda strange because while it's weird a script doesn't exist, the docs do talk about creating tasks like this. The way we ended up "working around" this issue was by defining a real build script in the package.json that uses concurrently to run the two build tasks (build:esm and build:types) in parallel, rather than letting Turborepo handle the parallelism. Not ideal, but it works. 🤷‍♂️

Also might be helpful to run with --summary flag and add that here

Here it is in a gist (from a run that exhibited the bug): https://gist.github.com/LukeTillman/f4d8f4fc687cd22ff882387d018c04f4

One other random thought--the cache hit (outputs already on disk) message in the logs seems to be the thing that's always present whenever the bug is triggered and absent when it's not. Whatever code path outputs that message is probably involved in whatever is going on. I know we mentioned this in previous comments, but probably worth repeating since it's been a bit.

Like I mentioned, we managed to work around it, but I'd definitely be happy to help with reproducing it, narrowing things down, updating docs if we're doing something wrong, etc. Bonus points if there's a fix and we can get rid of our workaround. Let me know how I can help!

Anulesh commented 6 months ago

I am also getting same issue cache hit, replaying logs f43ed37b94f47466, whereas It is not creating any buid directory at all, and it assuming that is already there, useing --force option for now to avoid this bug message and creating/restoring directories. I just cannot figure out even If I remove cache folders how it been showing the same flag cache hit. The only explanation could be somehow once it is enabled true it is not going back to be disabled.

Sheraff commented 6 months ago

If it's helpful, I think I have a pretty consistent repro in here https://github.com/Sheraff/root (commit e12c6d908375cccb8344be5cf5e066604cbeba37 at the time of writing). It's not a minimal reproduction, but the repo is pretty bare.

The reproduction

I've ran those steps a bunch of times, they seem to be 100% at reproducing the issue.

Steps to reproduce:

pnpm build --force # make sure it's executed in full at least once
# delete ./dist/client *from VSCode*
pnpm build # restores ./dist/client correctly
# delete ./dist *from VSCode*
pnpm build # doesn't restore anything

Interestingly, I do not reproduce the issue when doing the same thing but using rm -rf instead of deleting through VSCode

pnpm build --force # make sure it's executed in full at least once
rm -rf ./dist/client
pnpm build # restores ./dist/client correctly
rm -rf ./dist
pnpm build # restores ./dist correctly

Using pnpm rimraf instead of rm -rf to match the setup of @LukeTillman doesn't seem to reproduce the issue in my case (I tried with rimraf 4.4.0 like in his repro, and 5.0.5 the latest)


node -v # v20.11.0
pnpm -v # 8.15.2
pnpm turbo --version # 1.12.4

macOS: 14.3.1 (23D60) vscode: 1.86.2