oxc-project / oxc

⚓ A collection of JavaScript tools written in Rust.
https://oxc.rs
MIT License
11.33k stars 408 forks source link

Coworkers report that lint --fix will sometimes delete / empty / truncate a file. #6061

Open Raynos opened 1 week ago

Raynos commented 1 week ago

One of the typescript files gets replaced with the empty file and all its content get deleted. This is confusing and it's unclear which oxlint rule is doing this.

DonIsaac commented 1 week ago

Hi Raynos, thanks for the report. Do you think you could provide me some more details to reproduce?

  1. What rules do you have enabled?
  2. What exact command are you running?
  3. Can you provide a minimal reproduction of the file?

Thanks!

Raynos commented 1 week ago
oxlint -c=./.oxlintrc.json --tsconfig=./tsconfig.json . -D correctness -D perf -D suspicious --promise-plugin --import-plugin --fix
{
  "settings": {},
  "rules": {
    "no-unused-vars": "allow",
    "no-new-array": "allow",
    "no-empty-file": "allow",
    "no-document-cookie": "allow",
    "no-this-alias": [
      "deny",
      {
        "allow_names": ["self"]
      }
    ],
    "no-await-in-loop": "allow",
    "react-in-jsx-scope": "allow",
    "consistent-function-scoping": "allow",
    "no-async-endpoint-handlers": "allow",
    "no-new": "allow",
    "no-extraneous-class": "allow"
  }
}

I have not being able to reproduce this locally but I've asked my coworkers for a reproduction uasecase.

DonIsaac commented 1 week ago

@Raynos thanks for the details. I will start debugging this once I have a file I can reproduce on.

Raynos commented 1 week ago

one file that disappeared is workspaces/metrics-exporter/src/metadataQueue.ts when I hadn't touched anything in that workspace

coworker mentioned a file was truncated in a directory he has not touched on his branch.

Raynos commented 1 week ago

and it's flaky hard to reproduce.

Raynos commented 1 week ago

Happened to me too, it randomly deleted

workspaces/pipeline/src/memo/ram-memo.ts
Raynos commented 1 week ago

I'm running it in a loop and it deleted 5/6 random files :(

Raynos commented 1 week ago

https://github.com/PabloSzx/graphql-ez/tree/jake/reproduction

And

$ while true ; do npm run lint:fix ; done

Will reproduce it ; you should run pnpm install in this repo as its pnpm; not npm.

DonIsaac commented 1 week ago

@Raynos what OS are you and your colleagues using?

DonIsaac commented 1 week ago

...and is this how you guys run oxlint??

DonIsaac commented 1 week ago

Tl;Dr

This was weird. I think I found a way to make this stop happening, but I could not figure out or solve the underlying issue. Either way, I'll have a PR out after I finish writing this out.

Longer Version

So this bug was weird. My initial thought was "oh, maybe a deletion fix is getting applied to the whole file, or there's a bug in the fix merging or something". So I added an assertion after the new source file has been built.

// oxc_linter/src/service.rs line 268
let fix_result = Fixer::new(source_text, messages).fix();
assert(!fix_result.fixed_code.is_empty();

The assertion never panicked, and files kept getting deleted. What gives?

My next thought is that maybe something isn't getting flushed somewhere. So I replaced the line that writes the fixed source

// before
fs::write(path, fix_result.fixed_code.as_bytes()).unwrap()

With something that forces a blocking write-through to disk, using man 2 fsync

let mut file = File::create(path).unwrap();
file.write_all(fix_result.fixed_code.as_bytes()).unwrap();
file.flush().unwrap(); // for good measure
file.sync_all().unwrap();

When I tried that, oxlint went from 20m to 200ms, but still files kept getting deleted. And here is where I got confused.

This lead me down a rabbit hole of the weirdness and brokenness of close, especially in Rust, where close will silently fail (for example, if a system interrupt is sent while you're trying to close).

So what if we closed it ourselves?

unsafe {
  let exit = libc::close(file.into_raw_fd());
  assert_eq!(exit, 0);
}

And you know what? THAT DIDN'T WORK EITHER! can you tell this is driving me a little mad?

Finally, thankfully, I found a workable solution. It does not address any of the above-mentioned issues or weirdness, but it should work. Basically, the lint Runtime is always writing fixed content, _even if no fixes have been applied. This basically means that if you have --fix set, all files will have their content re-written whether or not any changes were actually made. This seems to solve the problem.

Sorry for the long reply, that was 2 hours of my life and I had to get it off my chest.

Raynos commented 1 week ago

@Raynos what OS are you and your colleagues using?

Mostly macintosh / macbooks, maybe one ubuntu linux user.

We don't run oxlint in a loop but yes the npm run lint:fix command is what we are running.

Raynos commented 1 week ago

I used the loop example to eventually reproduce the flaky behavior of file truncation.

Raynos commented 1 week ago

So i'm curious what happens if you do apply the fix to all files, but at the end of the program, instead of exiting the cli, you sleep for 10milliseconds.

Whether that will stop the behavior of empty buffer being written to the files.

Like is the problem here that writing the same content to the file plus flaky close behavior plus an efficient binary that at the end of the program writes to all the files, writes to stdout and exits the process cleanly leaves some file descriptors in a nonflushed / bad state.

Do you think your fix of not writing to the file if no fixes were applied will cause this bug frequency to go from "1 in 10 runs" to "1 in 10,000 runs" ?

I've never seen file writing operations get corrupted like this before, it seems like such a trustworthy battle hardened piece of the OS...

Raynos commented 1 week ago

Have you looked whether the issue is the non-atomic nature of fs::write ? https://github.com/rust-lang/rust/issues/82590

Could it be that multiple rules are trying to concurrently fix a file and write the original content back to the file and that causes the truncation ?

Would replacing the fs::write with fs::write to temp file and then an atomic file rename cause the problem to go away ? |

Edit: oh ffs, file renames are only atomic on linux, not on macos,

DonIsaac commented 1 week ago

So i'm curious what happens if you do apply the fix to all files, but at the end of the program, instead of exiting the cli, you sleep for 10milliseconds.

Whether that will stop the behavior of empty buffer being written to the files.

Like is the problem here that writing the same content to the file plus flaky close behavior plus an efficient binary that at the end of the program writes to all the files, writes to stdout and exits the process cleanly leaves some file descriptors in a nonflushed / bad state.

Do you think your fix of not writing to the file if no fixes were applied will cause this bug frequency to go from "1 in 10 runs" to "1 in 10,000 runs" ?

I've never seen file writing operations get corrupted like this before, it seems like such a trustworthy battle hardened piece of the OS...

I added a sleep 1 command inside the loop and saw this:

  1. With sync_all, no corruption occurred.
  2. Without sync_all some corruption still occurred, but less

Unfortunately the cost of syncing, at around 10x the runtime performance, is simply too high. Note that syncing without a sleep still causes corruption.

Boshen commented 1 week ago

Despite being fixed, we need an integration test to catch future regressions.

DonIsaac commented 1 week ago

I have no idea how to reproduce it consistently in an integration test

Raynos commented 5 days ago

if you want a slow integration test you can spawn oxlint as a child process in a loop with a timeout of 5s and assert no files were truncated.

You can try to revert the fix locally and see if 5s is enough looping to reproduce it ±50% of the time.

Raynos commented 5 days ago

For flaky behavior its basically impossible to write a consistent integration test, your just writing a test that will hopefully catch regressions on every second commit on average.

DonIsaac commented 2 days ago

@Boshen, do you think writing a 5s loop regression test is worth it?

Boshen commented 1 day ago

@Boshen, do you think writing a 5s loop regression test is worth it?

Probably, leave this to me, I'll think about this issue.