workers getting killed in 1.8.0+

lewis6991 commented 4 years ago

I've recently updated from 1.7.2 and I have found that https://github.com/mafredri/zsh-async/commit/361dc171e65c82f57ad814ebecea91c98a6d4b68 has caused a regression in my setup.

I use zsh-async to update my prompt with git info. Here's my implementation with only the relevant parts:

update_prompt() {
    cd $1
    rc=$2
    ~/.prompt $rc 0
}

refresh_prompt() {
    local output=$3
    local next_is_ready=$6

    # If there are multiple refreshes in flight then only use the latest one,
    # therefore we can ignore this output if the next is ready
    if [[ $next_is_ready == 1 ]]; then
        return
    fi

    PROMPT="$(echo $output)"
    zle reset-prompt
}

async_start_worker      gitprompt -n
async_register_callback gitprompt refresh_prompt

prompt_precmd() {
    rc=$?

    # Set initial prompt without scm info
    PROMPT=$(echo "$(~/.prompt zsh $rc 1)")

    async_flush_jobs gitprompt
    async_job gitprompt update_prompt "$(pwd)" "$rc"
}

add-zsh-hook precmd prompt_precmd

When I quickly refresh my prompt (pressing enter quickly in sucession), it causes my gitprompt worker to be killed. I also get a zle error.

cem-dev:lewrus01:fancy-prompt[master+2+1]
❯

cem-dev:lewrus01:fancy-prompt[master+2+1]
❯
refresh_prompt:zle:12: widgets can only be called when ZLE is active

async_job: no such async worker: gitprompt

cem-dev:lewrus01:fancy-prompt[...]
❯
async_job: no such async worker: gitprompt

I got neither of these problems on 1.7.2. I guess the problem is something to do with using the zle watcher?

Any help on this would be greatly appreciated.

lewis6991 commented 4 years ago

I've been able to reduce this down a lot:

async_start_worker      gitprompt
async_register_callback gitprompt refresh_prompt_callback

refresh_prompt_callback() {
    local job=$1 err=$2

    case $job in
        \[async])
            # Async worker has crashed
            if (( err == 2 )) || (( err == 3 )) || (( err == 130 )); then
                echo "ERROR($err)"
            fi
            ;;
    esac
}

Running this in an interactive shell and then quickly refreshing the prompt causes the worker to be killed. Any tips on how this could be debugged further?

mafredri commented 4 years ago

Thanks for reporting.

I find it strange that https://github.com/mafredri/zsh-async/commit/361dc171e65c82f57ad814ebecea91c98a6d4b68 is the root cause. It's a commit that doesn't really touch on any worker logic, only modifies the startup procedure a bit. I'm guessing you tried commits before and after to determine that this is where it started?

On what system and version of Zsh are you running into these issues?

If I was to venture a guess, I'd think async_flush_jobs gitprompt was the culprit (that crashes the worker). Can you reproduce the issue with flush jobs commented out?

Also, are you receiving any async error messages (i.e. the ones named [async])? If yes, what do they say?

refresh_prompt:zle:12: widgets can only be called when ZLE is active

This error should be avoidable by first checking that ZLE is active, zle && zle reset-prompt.

lewis6991 commented 4 years ago

I find it strange that 361dc17 is the root cause. It's a commit that doesn't really touch on any worker logic, only modifies the startup procedure a bit. I'm guessing you tried commits before and after to determine that this is where it started?

I tried quite a few commits but I could have made some mistakes in my testing. All I can say for sure is that 1.7.2 didn't exhibit any issues and 1.8.0 does.

On what system and version of Zsh are you running into these issues?

I'm running on rhe7 using a linuxbrew build of zsh 5.8

If I was to venture a guess, I'd think async_flush_jobs gitprompt was the culprit (that crashes the worker). Can you reproduce the issue with flush jobs commented out?

This was my first guess too but It didn't seem to make any difference.

Also, are you receiving any async error messages (i.e. the ones named [async])? If yes, what do they say?

Using my second code snippet the async error code is 2 so something is going on with ZLE.

This error should be avoidable by first checking that ZLE is active, zle && zle reset-prompt.

I saw this in the pure.zsh code. Whilst it will work around the error, I'm really curious as to why I'm seeing this now and not before.

For now I have been able to workaround the worker being killed by restarting it every time in precmd, this seems to give reliable behavior no matter what I do in the prompt. Ideally I would like to understand why I'm getting error code 2 when I refresh the prompt too quickly, is it likely to do with something inherent with ZLE?

Thanks

mafredri commented 4 years ago

Using my second code snippet the async error code is 2 so something is going on with ZLE.

This error originates from: https://github.com/mafredri/zsh-async/blob/490167c4aa5a870b3da1458859bcf3a9d1e24f97/async.zsh#L360 I have never actually ran into this error myself, I wonder what the actual ZLE error code is. Could you also try to log and share the error message that is sent (stderr for the callback)?

I'm also curious to know if you are reproducing the errors with the minimal configs you posted, exactly as-is? I.e. no other zsh plugins, settings, etc. And if so, could it be something ~/.prompt does?

lewis6991 commented 4 years ago

_async_zle_watcher:17: error: fd for gitprompt failed: zle -F 15 returned error hup

I also reduced down my .zshrc and found the plugin marlonrichert/zsh-autocomplete appears to cause the errors (in the sense the errors don't appear when that plugin is unloaded).

lewis6991 commented 4 years ago

This consistently produces the error for me:

async_start_worker      gitprompt
async_register_callback gitprompt refresh_prompt_callback

foo()  {
    A="a b c d"
    vared A
}

zpty testpty foo
zpty -d testpty

refresh_prompt_callback() {
    local job=$1 err=$2

    case $job in
        \[async])
            # Async worker has crashed
            if (( err == 2 )) || (( err == 3 )) || (( err == 130 )); then
                echo "ERROR($err): $5"
            fi
            ;;
    esac
}

mafredri commented 4 years ago

Ok, this definitely looks like an issue I've been combating since day one. Zpty destroy signals (i.e. HUP) is propagated to all zptys that we're created before the one being destroyed.

Most likely this change is the root cause of your current issues: https://github.com/mafredri/zsh-async/commit/32548d3c3f1361de57f09ab9293c902b78f49b55#diff-c7f89cff42efffc19f69071441a12a1cR86-R90

It should've been fixed by this commit: https://github.com/zsh-users/zsh/commit/caddeca1ac638137b26735fc8c63d08c83be6a90. But alas, we may have to revert the TRAPHUP change from above.

howardjohn commented 4 years ago

Is there any workarounds for this? I run into this many times a day which is a bit of a pain (but still better than not using zsh-async!)

reobin commented 4 years ago

@howardjohn If it's any help, the only thing I was able to do was reinitializing the workers when they die.

The second argument that is given to the callback function is the return code.

Docs on all the return codes:

1 Corrupt worker output.
2 ZLE watcher detected an error on the worker fd.
3 Response from async_job when worker is missing.
130 Async worker crashed, this should not happen but it can mean the file descriptor has become corrupt. This must be followed by a async_stop_worker [name] and then the worker and tasks should be restarted. It is unknown why this happens.

By just checking for this return code in the callback, you can reinitialize your workers when needed. I haven't had a problem in months with typewritten since I implemented that check.

Example of callback function checking for the return code:

tw_prompt_callback() {
  local tw_name=$1 tw_code=$2 tw_output=$3

  # Check for return codes indicating an error
  if (( tw_code == 2 )) || (( tw_code == 3 )) || (( tw_code == 130 )); then
    # reinit async workers
    async_stop_worker tw_worker # stop the current worker
    tw_async_init_worker # Init the worker again, and register the callback, see below
    tw_async_init_tasks # Init all the tasks
  elif (( tw_code )); then
    # return code is not empty, reinit all tasks
    tw_async_init_tasks
  fi;
  ...
}

# For reference purpose
tw_async_init_worker() {
  async_start_worker tw_worker -n
  async_register_callback tw_worker tw_prompt_callback
}

mafredri commented 4 years ago

@howardjohn I'm working on some improvements, but I can't say for sure if they will help.

First off, how are you using zsh-async? I've never been able to reproduce constant worker death but I know some scenarios that can cause it. For instance, sending hundreds of jobs to the worker in quick succession.

Edit: And what version of zsh are you using?

It'd be great if you could try out #45, then maybe #49. And finally there's a pretty huge rewrite going on in the (very WIP) test-rebased branch (based of the mentioned PRs). It's possibly the best bet at fixing worker death but will require a lot more testing and fine tuning.

@reobin it's not ideal, but indeed the best solution for current master branch, thanks for suggesting it!

howardjohn commented 4 years ago

@reobin thanks! that is essentially what i have been doing manually, ran it for a few hours and seems great.

I use it for my prompt, so it gets a decent number of jobs (every time I hit enter) but shouldn't be more than a couple per second

I also haven't reproduced it consistently so its hard to quickly test out changes but I can throw them in my shell for a while and see what happens

$ zsh --version
zsh 5.8 (x86_64-debian-linux-gnu)

howardjohn commented 4 years ago

After a couple days testing, the restart jobs workaround did not work (extremely likely i just set it up wrong), and https://github.com/mafredri/zsh-async/pull/45 also did not.

Will try #49 now

mafredri commented 4 years ago

@howardjohn Thanks for testing, and too bad about the workaround. If it's any help, here's how we set the worker restart up in Pure: https://github.com/sindresorhus/pure/blob/dfc8062c64df8821eaec7d741c75f3cee20d37e3/pure.zsh#L478-L495

howardjohn commented 4 years ago

As expected, I had the workaround messed up, simple type :woman_facepalming: . I verified the workaround does work, added some logging when it occurs so I see its transparently happened a couple times.

Unfortunately seems like #49 did not seem to help much here during my testing.

howardjohn commented 3 years ago

Update: 3 months later, the workaround to reset works great.

Not sure if this was known or not, but this can easily reproduce at least one occurrence of this (I have it print that it is restarting when it is triggered):

$ `exit`
restarting async. code=3

mafredri / zsh-async

workers getting killed in 1.8.0+ #42