Segfault in nimlangserver when `-d:nimOptIter` is defined

jmgomez commented 2 months ago

Description

Download and build latest https://github.com/nim-lang/langserver Remove the lock file and the requires from the nimlangserver.nimble so it uses the latest Nim to build. Run nimble build Open a nim project with the vscode extension (make sure the built nimlangserver is the one used)

It will crash with the following (you can see the output in the Nim Language Server tab or Nim Lsp if you are running a newer version)

Works fine with 2.0.8 and with 2.0.9

Traceback (most recent call last)
/Volumes/Store/Nim/lib/system/deepcopy.nim(190) genericDeepCopy
/Volumes/Store/Nim/lib/system/deepcopy.nim(132) genericDeepCopyAux
/Volumes/Store/Nim/lib/system/deepcopy.nim(72) genericDeepCopyAux
/Volumes/Store/Nim/lib/system/deepcopy.nim(68) genericDeepCopyAux
/Volumes/Store/Nim/lib/system/deepcopy.nim(153) genericDeepCopyAux
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Nim Version

Nim Compiler Version 2.1.99 [MacOSX: arm64]
Compiled at 2024-09-10
Copyright (c) 2006-2024 by Andreas Rumpf

git hash: baec1955b5453ec71fc12355142dd9a813fa02fb

Current Output

No response

Expected Output

No response

Known Workarounds

No response

Additional Information

No response

metagn commented 2 months ago

When trying to reproduce I didn't get a stacktrace, only "out of memory".

jmgomez commented 2 months ago

hmm In what OS are you testing? Do you get that same issue in version-2.0 too?

There is proc raiseOutOfMem() {.noinline.} = maybe you could add a stacktrace there

metagn commented 2 months ago

hmm In what OS are you testing? Do you get that same issue in version-2.0 too?

Windows; no; with refc/arc it gives the segfault exit code (as opposed to orc). The above stacktrace should only be reachable in refc from my understanding since usrToCell is not declared otherwise.

I'm out of my depth here, I thought maybe a type graph change or cycle was the cause, I also don't know where a deep copy could be happening

Araq commented 2 months ago

@ringabout @metagn Please bisect it.

jmgomez commented 2 months ago

Windows; no; with refc/arc it gives the segfault exit code (as opposed to orc). The above stacktrace should only be reachable in refc from my understanding since usrToCell is not declared otherwise.

I got the stacktrace in MacOs. Yes, refc is used

metagn commented 2 months ago

Currently got as far as:

started with good 185e06c92362083c06c76f87e325889b1c9dc659, bad a93c5d79b95b569711478b81d2ae9d19675fc4d8 (1 commit good after disabling effect errors) good 320311182 after disabling effect errors good c23d6a3cb good f12683873

metagn commented 2 months ago

Bisected to 05df263b84de9008266b3d53e2c28b009890ca61 (commit history was messed up the first time), @yglukhov any idea what the problem could be?

metagn commented 2 months ago

The type that deepcopy is iterating over during the segfault is tyRef. #23787 changes the creation of the environment ref type here. This might not have anything to do with it, but I tried to stress the added condition by mixing proc/iterator closures in the same owner context. I noticed this behavior difference:

proc foo() =
  let x = 123
  iterator bar2(): int {.closure.} =
    yield x
  proc bar() =
    let z = bar2
    for y in z(): # just doing bar2() gives param not in env: x
      echo y
  bar()

foo()

The output of this is now 1 when it used to be 0. Both are invalid, it should be 123. This is the only difference I've found that gives a runtime difference versus failing to compile. There is no problem when a closure iterator calls a closure proc, only when a closure proc calls a closure iterator.

Changing foo to be a closure iterator crashes the compiler with all permutations now, but is different before #23787. This used to give a codegen error, now segfaults when getEnvParam calls getEnvTypeForOwner:

iterator foo(): int {.closure.} =
  let x = 123
  iterator bar2(): int {.closure.} =
    yield x
  proc bar() =
    let z = bar2
    for y in z():
      echo y
  bar()
  yield x

for x in foo(): echo x

And this used to work and give the correct output but now also segfaults:

iterator foo(): int {.closure.} =
  let x = 123
  proc bar() =
    echo x
  iterator bar2(): int {.closure.} =
    bar()
    yield x
  for y in bar2():
    yield y

for x in foo(): echo x

I wouldn't even know where to start to try and minimize the segfault in nimlangserver, but the first closure proc/iterator mixing example seems realistic to encounter in async code. It was broken before too but maybe it zeroing made it more managable.

yglukhov commented 1 month ago

The issue is in copy codegen that violates memory, which was present before #23787, but was revealed likely because the memory layout changed. If i take nim version 051a53627 (just before my changes), and compile the following

proc foo() =
  let x = 123
  iterator bar2(): int {.closure.} =
    yield x
  proc bar() =
    let z = bar2
    for y in z(): # just doing bar2() gives param not in env: x
      echo y
  bar()

foo()

with gcc, I'm getting codegen error:

~/.cache/nim/test2_d/@mtest2.nim.c:330:70: error: passing argument 2 of ‘eqcopy___test50_u113’ from incompatible pointer type [-Wincompatible-pointer-types]
  330 |         nimln_(79);     eqcopy___test50_u113(&(*colonenv_).colonup_, colonenv_);

Take a careful look at this line:

 eqcopy___test50_u113(&(*colonenv_).colonup_, colonenv_);

It tries to copy colonenv_ into a field of itself, which is obviously wrong.

Unfortunately I'm not quite familiar with this logic, can anyone provide any input?

Araq commented 1 month ago

It's not "obviously wrong", closure environments can be self-referencing which is why you need a cycle collector for them. This is even more true for the so called "up" pointers that are so complex that I keep forgetting how they work.

yglukhov commented 1 month ago

Not pretending to understand, but my thinking was that "ups" always point to the parent closure env and "ups" can never be cyclic. Sure, cycles can form, just not through "ups". Am I wrong here? Also the pointer type mismatch kinda hints that this particular line is fishy at least?

Araq commented 1 month ago

Also the pointer type mismatch kinda hints that this particular line is fishy at least?

Sure, yes.

yglukhov commented 1 month ago

I've added a fix (#24316) for all the cases suggested by @metagn. @jmgomez Could you please verify? @metagn just to double check, were you able to reproduce the originally reported stack trace?

metagn commented 1 month ago

Langserver seems to have circumvented the original issue in the master branch. Unfortunately, for the commit https://github.com/nim-lang/langserver/commit/883935707f762ce135c27d40b4ef2baf9e85fd86 which I was testing on, #24316 does not fix it in my reproduction, the minimizations above were probably not correct.

Running nimble --nim:... test in langserver also reproduces the issue in tsuggestapi, VS Code isn't needed.

It crashes right before this call which was removed in this commit. The full log from langserver in VS Code (before restarts) is:

DBG Starting nimlangserver                     params="(clientProcessId: none(int), transport: some(stdio), port: 0)"
DBG Starting stdio server                     
DBG [Processsing Message]                      request="\"initialize\""
DBG Initialize received...                    
DBG Registering monitor for process            pid=25960
DBG Initialize completed. Trying to start nimsuggest instances
DBG [Processsing Message]                      request="\"initialized\""
DBG Client initialized.                       
DBG Requesting configuration from the client  
DBG [Processsing Message]                      request="\"textDocument/didOpen\""
DBG New document opened for URI:               uri=file:///test.nim
DBG Auto-guessing project file for             file="test.nim"
DBG getProjectFile                             project="test.nim" fileUri="test.nim"
DBG Document associated with the following projectFile uri=file:///test.nim projectFile="test.nim"
DBG Will create nimsuggest for this file       uri=file:///test.nim
DBG ShowMessage                                message="Using Nim Compiler Version 2.2.0 [Windows: amd64] from nim-2.2.0/bin"
INF Starting nimsuggest                        root="test.nim" timeout=120000 path="nim-2.2.0/bin/nimsuggest" workingDir="..."
DBG Parsing nimsuggest capability              capability=con
DBG Parsing nimsuggest capability              capability=exceptionInlayHints
DBG Parsing nimsuggest capability              capability=unknownFile
DBG Nimsuggest Capabilities                    capabilities="{con, exceptionInlayHints, unknownFile}"
out of memory
[Error - 3:52:39 PM] Server process exited with code 1.
[Info  - 3:52:39 PM] Connection to server got closed. Server will restart.
true

It seems like neither thread created above it actually starts, and not creating the threads does not cause the issue. Just creating a thread that takes a tuple[] argument and does nothing also causes the crash.

yglukhov commented 1 month ago

Ok, finally managed to reproduce, thanks @metagn.

Now, TLDR. This issue should be fixed not by reverting my pr, but in any other suitable way!))

Long description: There are a few problems I'm seeing here.

Nim's threading api requires the thread object to exist while the thread is running. This was indeed the case before my optimization, but now - not so, because the thread object disappears right before the await after createThread. To put it bluntly, the threading API has been misused. Whether this api is good is a question for another issue.
Nim performs the deepCopy of thread arguments inside the thread, which is wrong, because the arguments can mutate by this time. Instead the deepCopy should be performed before the thread is started. In this particular case this issue is not as relevant, but again, a nice question for another issue.

The statements above can be easily backed by one of the following experiments:

Mark thread objects with {.liftLocals.} pragma, which puts them into the async function env. Note that this pragma should not be used in production, as it is not documented, and it will most likely be removed in the future, but it's perfect for the demo)
Insert a sleep(100) after all the createThreads. It gives the threads enough time to perform the aforementioned deepCopy.

Finally I suggest closing this issue for the reason of misused threading api.

metagn commented 1 month ago

Sounds reasonable, in the worst case maybe a changelog entry for the new behavior and disclaimer.

Nim's threading api requires the thread object to exist while the thread is running.

It uses a potentially escaping address for this. If we don't change this it should at least still be documented in createThread, something like "the address of the thread object should be available throughout the thread's lifetime".

We can track this in this issue too since this is the root problem and the iterator changes were only tangential, but closing it is fine too.

nim-lang / Nim