Improve the performance of nbdime

dclong commented 3 years ago

I have a notebook which is about 4.3M. There's a small change to the notebook but it took 8 minutes to run the command nbdiff on a very powerful machine.

muxator commented 3 years ago

I had a similar problem today.

I tried to diff a 16 MB notebook without success: the RAM usage went through the roof, and I had to hit CTRL+C before hanging my system.

This resulted in a stacktrace (the signal handling may be improved?) which paradoxically gave a pointer to a possible candidate for improving RAM usage (which I suspect may be related to your problem):

This is the function nbdiff hanged on:

https://github.com/jupyter/nbdime/blob/409898921d5dfb4e248fd791dfe4ca47776a0f6d/nbdime/diffing/seq_bruteforce.py#L15-L17

The function returns a list of lists, i.e., stores in memory len(A) * len(B) elements. That list of lists is then iterated over by other functions in the call stack.

Returning a generator expression may be helpful, but the logic of the calling functions would then have to be changed as well: currently they are assuming that it is possible to have random access to every element of the array, and that their sizes are known. They would need to be rewritten in a streaming fashion (itertools may be helpful here).

I am not an avid Jupyter user. This problem was just tangential for me today, but it was glaring enough to be worth a look in nbdime's issue tracker, and here I am :smile:

muxator commented 3 years ago

Here's a quick profiling of the long running diff to get started with the analysis. Maybe we'll discover that the problem is somewhere else :smile: :

$ python -m cProfile -s cumtime venv/bin/nbdiff notebook1.ipynb notebook2.ipynb.old

[CTRL+C]

         166281829 function calls (166075588 primitive calls) in 11.491 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    360/1    0.002    0.000   24.878   24.878 {built-in method builtins.exec}
        1    0.000    0.000   24.878   24.878 nbdiff:3(<module>)
        1    0.000    0.000   24.712   24.712 nbdiffapp.py:129(main)
        1    0.000    0.000   24.708   24.708 nbdiffapp.py:30(main_diff)
        1    0.000    0.000   24.708   24.708 nbdiffapp.py:50(_handle_diff)
        1    0.000    0.000   24.176   24.176 notebooks.py:605(diff_notebooks)
     48/1    0.000    0.000   24.176   24.176 generic.py:81(diff)
     24/1    0.000    0.000   24.176   24.176 generic.py:193(diff_dicts)
     33/1    0.000    0.000   24.176   24.176 generic.py:107(diff_sequence_multilevel)
       33    0.000    0.000   24.163    0.732 snakes.py:35(compute_snakes_multilevel)
       33    0.001    0.000   24.163    0.732 snakes.py:18(compute_snakes)
       33    0.000    0.000   23.118    0.701 seq_bruteforce.py:63(bruteforce_compute_snakes)
       33    0.000    0.000   23.116    0.700 seq_bruteforce.py:15(bruteforce_compare_grid)
       33    0.330    0.010   23.116    0.700 seq_bruteforce.py:17(<listcomp>)
     32/1    0.000    0.000   23.107   23.107 snakes.py:76(compute_diff_from_snakes)
        3    0.000    0.000   22.076    7.359 notebooks.py:382(diff_single_outputs)
       24    0.000    0.000   22.075    0.920 sequences.py:53(diff_strings_linewise)
        2    0.000    0.000   22.074   11.037 notebooks.py:463(diff_mime_bundle)
        3    0.001    0.000   22.074    7.358 notebooks.py:414(add_mime_diff)
       24    0.000    0.000   22.065    0.919 generic.py:123(diff_lists)
165023450    8.345    0.000    8.345    0.000 {built-in method _operator.eq}
       16    0.000    0.000    2.088    0.130 notebooks.py:269(compare_output_strict)
       16    0.000    0.000    2.088    0.130 notebooks.py:181(compare_mimebundle_strict)
       20    0.000    0.000    2.088    0.104 notebooks.py:154(compare_mimedata_strict)
       20    0.019    0.001    2.088    0.104 notebooks.py:124(_compare_mimedata)
        8    0.000    0.000    2.068    0.259 notebooks.py:115(_compare_mimedata_strings)
        8    0.000    0.000    2.062    0.258 notebooks.py:54(_is_base64)
       98    2.015    0.021    2.015    0.021 {method 'match' of 're.Pattern' objects}
       47    0.013    0.000    1.089    0.023 {built-in method builtins.all}
      182    0.000    0.000    1.069    0.006 notebooks.py:352(compare_cell_strict)
      175    0.000    0.000    1.044    0.006 snakes.py:30(<genexpr>)
        2    0.000    0.000    0.532    0.266 utils.py:25(read_notebook)
        2    0.001    0.000    0.532    0.266 __init__.py:113(read)

python 3.8.2
nbdime 2.0.0
linux

vidartf commented 3 years ago

The key to getting this fixed (and any other issues) would be to share some reproducible example. Where are those MBs for the notebook located? Are there any specific characteristics? E.g. are they all large HTML outputs with embedded base64 strings?

vidartf commented 3 years ago

@muxator The non-optimal brute force is a known issue (#3), but it is typically not a major issue for cells/source inputs, as these tend to be small. But if someone has a symptomatic case which we can use as a baseline, then optimizing performance becomes much easier 👍

muxator commented 3 years ago

if someone has a symptomatic case which we can use as a baseline, then optimizing performance becomes much easier

@vidartf, I agree.

I cannot share the specific notebook, because it was not written by me. I'll see if I can produce a redacted version with the same pathological characteristics, performance-wise.

This specific notebook was created by a scientist with a deep knowledge in his domain, that happens to use a notebook to fast prototype something, but that has no specific Computer Science background (and is not required to have, after all).

I am not invested at all in the Jupyter ecosystem, so I may be wrong here: this kind of figure probably fits very well in the target demographic for Jupyter. With such a user base I think that worst-case performance problems are guaranteed to be hit :smile: Also, factor in that if the user has a beefy development machine, he can sometimes be even masked away from problematic code/data.

Back to the point: I'll try to summarize the notebook contents asap. I hope I can find a way to redact it (or synthesize something similar).

There is an alternative maybe: @dclong, would it be easy for you to share a (anonimyzed) version of your problematic notebook? You are probably more proficient than me in using Jupyter.

muxator commented 3 years ago

In the meantime, a quick summary of my problematic notebook:

Total size: 16 MB;
of these, 13 MB is a single line of base64-encoded PNG data;
the remaining 3 MB are distributed in 98k lines of json, but there are only ~15 cells.

This probably ticks all the boxes for a pathological notebook :smile:, yet it appears completely legit to the eyes of the user.

jupyter / nbdime

Improve the performance of nbdime #540