mafredri / zsh-async

Because your terminal should be able to perform tasks asynchronously without external tools!
MIT License
771 stars 35 forks source link

Callback function not executed on first jobs #40

Open mjrafferty opened 4 years ago

mjrafferty commented 4 years ago

First I want to thank you for sharing this project, it's a great time saver for creating tools that run asynchronous jobs. I'm using this library in a prompt framework I created:

https://github.com/mjrafferty/apollo-zsh-theme

I'm seeing an issue (though only on my raspberry pi) after the changes in #37 where the callback function isn't being executed on the first prompt load after starting the async worker. Rolling back to ASYNC_VERSION 1.7.2 resolves the issue. I added some debug logging to the _async_job function, and the jobs are still being run and returning good output, but the callback function isn't getting run. The next time the async jobs are run the callback function returns output just fine.

Since I'm only able to reproduce this on a raspberry pi, my guess is something is just lagging a little behind in registering the callback, but adding various sleeps to the code didn't seem to resolve the issue. Are you able to shed some light on how the changes made in #37 might impact that?

mafredri commented 4 years ago

Hi @mjrafferty, thanks for reporting.

Could you explain how async is being used in the apollo theme? I took a quick look at the code but things are split up and I have a hard time building a mental model of what order things are happening in.

Generally I'd recommend registering the callback as soon as the async worker is started, that way you shouldn't miss out on any messages. Also, with notify mode (-n) and a registered callback, there's no need to manually invoke async_process_results, I saw it being used here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_precmd#L29-L30

mafredri commented 4 years ago

Btw, what version of Zsh are you using on the rpi?

mjrafferty commented 4 years ago

I'm using zsh 5.8 on the pi, though It works just fine on my mac which is also using 5.8. The async_process_results call you see there is likely mostly pointless. I start the async jobs prior to that during the __apollo_run_prompt function a few lines above that. At that point in the code I don't have a callback function set, as that gets unregistered in preexec and then registered again right there in precmd. The intended purpose of the async_process_results call there is to process anything that might be finished before drawing the prompt and without interrupting precmd execution before it's ready for the output.

The async library is loaded and the worker started here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_load_utilities#L112

And the jobs are initiated during precmd before registereing the callback function here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_get_module_text#L42

mafredri commented 4 years ago

Hmm, that's interesting.

I have two suggestions:

  1. Could you test if it's reproducible with only https://github.com/mafredri/zsh-async/commit/32548d3c3f1361de57f09ab9293c902b78f49b55 which excludes the two latter commits that are part of 1.8.0?
  2. On 1.8.0 would the following help:

Adding:

async_process_results $worker $ASYNC_CALLBACKS[$worker] watcher

After this if-statement: https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L463-L473

I'll try to look into this some more but in the mean time these could provide some clues.

mjrafferty commented 4 years ago

It is still present on 32548d3, and unfortunately that async_process_results call didn't have an impact.

I'm testing specifically with a call that asynchronously makes a curl request to icanhazip.com, and I did however find out that the async_process_results in my code is usually what catches that specific output even though I mentioned before that I thought it was probably useless. Though when testing this with 1.7.2 I can just as easily remove that line and the callback function will instead update my prompt as expected. With 1.8.0 it seems neither of those work on the first run.

I added debug code to dump this to a log file:

https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L60

I then added a second line to send output to that same log file when the async_process_results function is run. Doing this I was able to confirm that the async job was finished prior to the call to get the results, yet no data was available. I added logging here to confirm that:

https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L294-L299

The while loop executes once, but the $data variable is empty despite the async_job being finished. So it looks like this isn't an issue with executing the callback function as I initially thought, but rather the first command isn't sending back any output from the worker for some reason even though it ran and produced output.

mjrafferty commented 4 years ago

I did a bit more testing on this by disabling the async_process_results and async_register_callback calls in my code. I then checked the output of the job manually using zpty -r apollo_async_helper and it looks like it is getting partial data. Here is the output:

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 192\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 220

That looks to be a part of the stderr output from the curl request, but all of the data that should have been before it is missing. Here is the comparison from the second time the job runs that has all of the data:

__apollo_public_ip_async_runner 0 public_ip\\069.244.159.151\\0default 0.2871527672 \ \ %\ Total\ \ \ \ %\ Received\ %\ Xferd\ \ Average\ Speed\ \ \ Time\ \ \ \ Time\ \ \ \ \ Time\ \ Current$'\n'\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 153\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 272%

So it seems like something is either truncating the data during just after startup or it just isn't ready for writing yet. I was able to add a 1 second sleep in the async job to delay the output and it works correctly.

Would any of the recent code changes have slowed the startup time for the worker, or is there anything run that flushes the output after startup?

mafredri commented 4 years ago

Thanks for digging @mjrafferty, I can't seem to reproduce this so it helps.

There's only one change I can think of that would affect this behavior, I've reverted it in this branch if you'd like to test it: https://github.com/mafredri/zsh-async/tree/issue-40

Normally I would also have suspected async_job $worker _unset_trap which was removed in https://github.com/mafredri/zsh-async/commit/361dc171e65c82f57ad814ebecea91c98a6d4b68 but you said it happens even without that change.

We could also try to add a simple print statement here (prints a newline), or print -n $'\0' if that causes problems: https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L178

My thinking is that it might make a difference if we prime the zpty fd by outputting something from the worker. We shouldn't have to, but zpty's are a bit fidgety 😅.

mjrafferty commented 4 years ago

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L409

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

mjrafferty commented 4 years ago

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async.zsh#L409

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

mafredri commented 4 years ago

Interesting find! This is definitely not the first time we've had to deal with racy behavior of zptys and it's hard to figure out why they happen. The zpty implementation is kind of a hack, unfortunately.

I reverted the corruption check and pushed two more commits to the issue-40 branch. I wonder if they make any difference? I'm guessing they wont but one can hope.

There's also a test in the async test suite that can trigger racy behavior: https://github.com/mafredri/zsh-async/blob/a9ca05110910bdc738f6c88d6fbe2d1043c31490/async_test.zsh#L129-L152

When iter is bumped to 100 or more, commands start failing and sometimes the worker sees them entirely as missing or overlapping. Think print 80, print 81, prprint 83, etc. It's possible these two issues are related but I've spend a lot of time debugging that as well, to no avail :/.

mafredri commented 4 years ago

@mjrafferty I may have a fix for this in #45, if you still have a way to test this, it'd be very much appreciated.

What I'm thinking, specifically, could affect this is https://github.com/mafredri/zsh-async/pull/45/commits/dfca4ba065fc1a7c82699ba3552863eb56e98066. It was a fix to an observation I made where the zpty was being written to from both ends at the same time.

In essence, the first job completes (and prints) while the second job is being sent to the async worker. This resulted in the first job result being erased.

mjrafferty commented 4 years ago

Unfortunately the only device I had this issue on is no longer in use. Based on prior observations though that does sound promising. I can at least confirm that using #45 didn't seem to cause any regressions.