rapidsai / ucx-py

Python bindings for UCX
https://ucx-py.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
121 stars 58 forks source link

Switch pytest traceback to `native` #1075

Closed galipremsagar closed 2 weeks ago

galipremsagar commented 1 month ago

Description

In cudf & cuml we have observed a ~10% to ~20% respectively speed up of pytest suite execution by switching pytest traceback to --native:

currently:

102474 passed, 2117 skipped, 902 xfailed in 892.16s (0:14:52)

--tb=short:

102474 passed, 2117 skipped, 902 xfailed in 898.99s (0:14:58)

--tb=no:

102474 passed, 2117 skipped, 902 xfailed in 815.98s (0:13:35)

--tb=native:

102474 passed, 2117 skipped, 902 xfailed in 820.92s (0:13:40)

This PR makes similar change to ucx-py repo.

xref: https://github.com/rapidsai/cudf/pull/16851

vyasr commented 1 month ago

AFAICT there are no xfailed tests in this repo, so I don't think the results are meaningful. I would suggest rerunning a few times. These results seem well within the noise threshold for this repository's test suite. You looked at #1070 and #1069, but consider the one before that #1067. In that case the times are 7:28, 8:26, and 5:44.

pentschev commented 1 month ago

AFAICT there are no xfailed tests in this repo, so I don't think the results are meaningful. I would suggest rerunning a few times. These results seem well within the noise threshold for this repository's test suite.

So if the results are not meaningful because there are no xfailed tests, yet they seem worse, what is your suggestion we do? So far, I'm not seeing any potential benefits here.

You looked at #1070 and #1069, but consider the one before that #1067. In that case the times are 7:28, 8:26, and 5:44.

So are you saying those results are better or worse? If I sum this PR's results [1] and #1067 's [2], this is still doing worse:

In [1]: 7 * 60 + 8 + 7 * 60 + 23 + 8 * 60 + 4
Out[1]: 1355

In [2]: 7 * 60 + 28 + 8 * 60 + 26 + 5 * 60 + 44
Out[2]: 1298

I forgot to link my Dask-CUDA comment previously, and for completeness I'm also looking for an answer to the following:

Do you have more information exactly how does this compare to --tb=auto? From the docs:

pytest --tb=auto    # (default) 'long' tracebacks for the first and last
                     # entry, but 'short' style for the other entries
pytest --tb=long    # exhaustive, informative traceback formatting
pytest --tb=short   # shorter traceback format

It seems that with auto we get "exhaustive" traceback (not really sure what that means), is that the case with native too?

I agree the number of samples is low and rerunning is the obvious next step, I'll have to wait and see the results to be convinced by results, but if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

vyasr commented 1 month ago

So are you saying those results are better or worse?

I was trying to add an extra column to the table that you had. Based on the numbers in your table, it looked like you were looking at the time for the "Python tests" step in the conda-python-tests job, so for instance this 11.8 job that shows 7:08 is the top-left entry in your table, right? Assuming that is correct, the corresponding job in #1067 is this one, which has a run time of 7:28. The sum is probably worse because the 12.0.1 job seems to have run much faster. I'm just not convinced that any of these numbers are signal; they all seem within the noise threshold to me.

if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

IMHO we aren't losing useful debugging information. I don't find the long tracebacks particularly helpful or readable relative to the default. That was also the opinion expressed by a few others when discussing this change. Of course, not everyone will agree on something like that so I think it makes more sense to make the decision based on the concrete objective improvements of timing.

So if the results are not meaningful because there are no xfailed tests, yet they seem worse, what is your suggestion we do?

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests. We can't guarantee that ucx-py (or the other repos) will never use xfail markers, and as soon as they do we'll be paying a heavier price than we should by not having this setting. The data from the repositories that do currently use xfails is convincing IMHO. If you disagree and think that the results above are not meaningful, then feel free to either rerun to gather more data or close these PRs until there are xfailed tests that could be used to gather data (note that there is also some sort of GH outage that's been happening this morning that could impact numbers; if we want good numbers we probably should run locally in a controlled environment and not in CI). Realistically, though, I think we will forget to do that and we're better off merging now.

galipremsagar commented 1 month ago

if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

This is how the information representation is going to change with native traceback:

With pytest traceback:

Screenshot 2024-09-19 at 2 34 57 PM Screenshot 2024-09-19 at 2 35 07 PM Screenshot 2024-09-19 at 2 35 20 PM

With native traceback:

Screenshot 2024-09-19 at 2 34 04 PM

We basically stop letting pytest do all the pretty printing/coloring/and printing entire source code file but using native traceback.

pentschev commented 1 month ago

We basically stop letting pytest do all the pretty printing/coloring/and printing entire source code file but using native traceback.

Thanks @galipremsagar , but I actually saw those screenshots previously in https://github.com/rapidsai/cudf/pull/16851#issue-2537240497 . However, that doesn't confirm whether we lose information or not. As per the docs, the regular default is auto which is a combination of long ("exhaustive") and short, and while I do not mind that much of the pretty printing, I do mind if we lose frames or information from frames by switching to native. Do you know if anywhere it's documented that their contents is exactly the same, or maybe even more complete with native than auto?

pentschev commented 1 month ago

I've updated the runtimes for the most recent job in https://github.com/rapidsai/ucx-py/pull/1075#pullrequestreview-2329225358, and it still seems to take longer:

In [1]: (7 + 7 + 6) * 60 + 45 + 24 + 54
Out[1]: 1323
pentschev commented 1 month ago

I was trying to add an extra column to the table that you had. Based on the numbers in your table, it looked like you were looking at the time for the "Python tests" step in the conda-python-tests job, so for instance this 11.8 job that shows 7:08 is the top-left entry in your table, right? Assuming that is correct, the corresponding job in #1067 is this one, which has a run time of 7:28. The sum is probably worse because the 12.0.1 job seems to have run much faster. I'm just not convinced that any of these numbers are signal; they all seem within the noise threshold to me.

Sorry, I missed this comment yesterday. Yes, your interpretation is correct. When you say you're not convinced any of these numbers are a signal, that's fair, but what would then be the signal? I currently see no signals that show a benefit from this change, if there's any I'm missing, can you please point them out?

IMHO we aren't losing useful debugging information. I don't find the long tracebacks particularly helpful or readable relative to the default. That was also the opinion expressed by a few others when discussing this change.

I would normally tend to agree, at least for regular Python sync libraries. When it comes to Python async you normally have almost no useful information, plus it's often the case it's very hard to reproduce them anywhere else but CI. Because of that I don't find we have the luxury to discard any possible hint to what may be happening, therefore I'm against discarding anything especially since there's no clear answer to what exactly we might be missing and so far it looks to me more like a gut feeling.

Of course, not everyone will agree on something like that so I think it makes more sense to make the decision based on the concrete objective improvements of timing.

I would have agreed with that, this change has demonstrated no concrete improvements of timing, so at this point it seems that we're just discussing what was the expected result and not what we are really observing. If there's any clear signs here or in Dask-CUDA that this has indeed improved performance, please point them out and I'll be happy that we did improve runtime, but so far there have been no signs of that and I would argue it looks like it have made things worse.

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests. We can't guarantee that ucx-py (or the other repos) will never use xfail markers, and as soon as they do we'll be paying a heavier price than we should by not having this setting. The data from the repositories that do currently use xfails is convincing IMHO. If you disagree and think that the results above are not meaningful, then feel free to either rerun to gather more data or close these PRs until there are xfailed tests that could be used to gather data (note that there is also some sort of GH outage that's been happening this morning that could impact numbers; if we want good numbers we probably should run locally in a controlled environment and not in CI). Realistically, though, I think we will forget to do that and we're better off merging now.

Well, if there were conclusive signs showing that they were at least "as good", and that we're indeed not losing any potentially useful information (or just simply hints, which is usually the best we can get for async) I would agree with merging. Right now what I'm hearing is "there's a promise it will be better, and the feeling that the information discarded is not useful", and what I observe is "this is looking worse, and I don't know if anything that will be discarded will be missed". To me personally, who's the primary person who has to deal with debugging in UCX-Py/Dask-CUDA, I don't look at the results here as a benefit but rather as a loss.

If anybody has the time to do those tests in a controlled environment and show that results are indeed beneficial, I'd happy to change my opinion.

pentschev commented 1 month ago

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests.

Also when you say "there are measurable benefits when there are xfailed tests", I'm assuming this requires a large amount of xfailed tests to benefit, is that right? If not, it should be pretty easy to add a couple xfailed tests and clearly demonstrate the runtime benefit on a test set the size of UCX-Py/Dask-CUDA.

galipremsagar commented 1 month ago

@pentschev I agree that this change will not have any improvement in runtimes due to lack of xfails/xpass tests in ucx-py. However, this PR doesn't make the runtime worse. The time in the above table is for the entire CI run with so many conda package install times/git checkouts/etc..etc included too which this pr doesn't influence. The following are the exact pytest execution times we need to look at:

Run/[CUDA Version] 11.8.0 12.0.1 12.5.1
This change run #1 02:23mins & 31.30s 02:08mins & 28.87s 02:40mins & 41.07s
This change run #2 02:29mins & 38.65s 02:11mins & 28.32s 02:24mins & 34.88s
PR #1070 02:22mins & 32.99s 02:09mins & 27.10s 02:24mins & 32.53s
PR #1069 02:27mins & 32.41s 02:12mins & 27.71s 02:38mins & 43.01s
PR #1067 02:31mins & 30.62s 02:10mins & 25.69s 02:26mins & 34.47s
PR #1078 run #1 02:23mins & 39.44s 02:11mins & 27.80s 02:30mins & 35.74s
PR #1078 run #2 02:25mins & 35.45s 02:12mins & 28.10s 02:32mins & 36.72s

As you see there is no change in pytest times.

Coming to printing the stack-trace, --tb=native will just print the errors with standard python library formatting and full-stack trace. There is no more official documentation to show the difference apart from local testing and from local testing I see the following traceback: https://github.com/rapidsai/ucx-py/pull/1075#issuecomment-2375258715, if you think due to the nature of async execution this will be a loss of information I'm happy to close this & dask-cuda PRs.

vyasr commented 1 month ago

I do think you need a nontrivial number of xfails to see a difference, yes. I am pretty much 100% sure that there is no negative performance impact from this change, but I do agree that debugging async code (and hard-to-reproduce bugs) can make more informative tracebacks much more important. The native traceback does not discard any frames, it simply removes the pretty-printing and the extra context (normal traces just show the relevant code line, the pytest trace includes many extra lines of context with pretty-printing). If that extra information being displayed in CI is helpful to debug then I think it's fine to close this PR since it sounds like you also don't anticipate xfailing much.

pentschev commented 3 weeks ago

Sorry I let this fall behind. Based on https://github.com/rapidsai/ucx-py/pull/1075#issuecomment-2378097541 I'm less worried about the time, however, I'm still not 100% convinced this will not remove useful information from the errors that eventually pop in CI. With that said, I'm ok merging this but I'll keep an eye as to whether this removes too much information and if I notice it does I will most likely want to revert it.

pentschev commented 2 weeks ago

/merge