PDF writing fails with joblib.Parallel using (default) Loky backend

leoschwarz commented 1 month ago

What happened?

Dear developers, I'm not sure if this is well known, but the following code

import altair as alt
import joblib
import pandas as pd
import os

os.environ["RUST_BACKTRACE"] = "full"

def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_point().encode(x="x", y="y")
    chart.save(filename)

filenames = [f"chart{i}.pdf" for i in range(2)]
joblib.Parallel(n_jobs=2)(joblib.delayed(write_chart)(filename) for filename in filenames)

results in an error (full message below).

I'm reporting it here since I triggered with altair and would be nice to address with a fix or documentation, but maybe the issue originates in another project and it is beyond the scope of this issue tracker. If you think this would better fit into the loky or deno tracker I'm happy to move it there.

/Users/leo/code/msi/.venv/bin/python /Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py 
thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deno_runtime-0.166.0/worker.rs:701:7:
Bootstrap exception: TypeError: testEnabled is not a function
    at init (ext:deno_node/internal/util/debuglog.ts:51:15)
    at debug (ext:deno_node/internal/util/debuglog.ts:54:5)
    at logger (ext:deno_node/internal/util/debuglog.ts:69:29)
    at readableAddChunk (ext:deno_node/_stream.mjs:2797:7)
    at Readable.push (ext:deno_node/_stream.mjs:2791:14)
    at initStdin (ext:deno_node/_process/streams.mjs:185:13)
    at Object.internals.__bootstrapNodeProcess (node:process:653:22)
    at initialize (ext:deno_node/02_init.js:34:15)
    at bootstrapMainRuntime (ext:runtime_main/js/99_main.js:873:7)
stack backtrace:
   0:        0x31089fa1c - _BrotliDecoderVersion
   1:        0x310098630 - _BrotliDecoderVersion
   2:        0x310875ef8 - _BrotliDecoderVersion
   3:        0x3108a20b8 - _BrotliDecoderVersion
   4:        0x3108a19d0 - _BrotliDecoderVersion
   5:        0x3108a15d8 - _BrotliDecoderVersion
   6:        0x3108a28e8 - _BrotliDecoderVersion
   7:        0x3108a23d0 - _BrotliDecoderVersion
   8:        0x3108a2338 - _BrotliDecoderVersion
   9:        0x3108a232c - _BrotliDecoderVersion
  10:        0x312068e9c - _wgpu_render_bundle_insert_debug_marker
  11:        0x310a71160 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  12:        0x310a6e184 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  13:        0x310a806c0 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  14:        0x3108a4e74 - _BrotliDecoderVersion
  15:        0x1937baf94 - __pthread_joiner_wake
thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deno_runtime-0.166.0/worker.rs:701:7:
Bootstrap exception: TypeError: testEnabled is not a function
    at init (ext:deno_node/internal/util/debuglog.ts:51:15)
    at debug (ext:deno_node/internal/util/debuglog.ts:54:5)
    at logger (ext:deno_node/internal/util/debuglog.ts:69:29)
    at readableAddChunk (ext:deno_node/_stream.mjs:2797:7)
    at Readable.push (ext:deno_node/_stream.mjs:2791:14)
    at initStdin (ext:deno_node/_process/streams.mjs:185:13)
    at Object.internals.__bootstrapNodeProcess (node:process:653:22)
    at initialize (ext:deno_node/02_init.js:34:15)
    at bootstrapMainRuntime (ext:runtime_main/js/99_main.js:873:7)
stack backtrace:
   0:        0x14889fa1c - _BrotliDecoderVersion
   1:        0x148098630 - _BrotliDecoderVersion
   2:        0x148875ef8 - _BrotliDecoderVersion
   3:        0x1488a20b8 - _BrotliDecoderVersion
   4:        0x1488a19d0 - _BrotliDecoderVersion
   5:        0x1488a15d8 - _BrotliDecoderVersion
   6:        0x1488a28e8 - _BrotliDecoderVersion
   7:        0x1488a23d0 - _BrotliDecoderVersion
   8:        0x1488a2338 - _BrotliDecoderVersion
   9:        0x1488a232c - _BrotliDecoderVersion
  10:        0x14a068e9c - _wgpu_render_bundle_insert_debug_marker
  11:        0x148a71160 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  12:        0x148a6e184 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  13:        0x148a806c0 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  14:        0x1488a4e74 - _BrotliDecoderVersion
  15:        0x1937baf94 - __pthread_joiner_wake
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py", line 12, in write_chart
    chart.save(filename)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 2086, in save
    save(**kwds)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/save.py", line 224, in save
    perform_save()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/save.py", line 189, in perform_save
    mb_any = spec_to_mimebundle(
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/mimebundle.py", line 134, in spec_to_mimebundle
    return _spec_to_mimebundle_with_engine(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/mimebundle.py", line 267, in _spec_to_mimebundle_with_engine
    pdf = vlc.vegalite_to_pdf(
          ^^^^^^^^^^^^^^^^^^^^
ValueError: Vega-Lite to PDF conversion failed:
Failed to retrieve conversion result: oneshot canceled
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py", line 16, in <module>
    joblib.Parallel(n_jobs=2)(joblib.delayed(write_chart)(filename) for filename in filenames)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
ValueError: Vega-Lite to PDF conversion failed:
Failed to retrieve conversion result: oneshot canceled

Process finished with exit code 1

What would you like to happen instead?

The loop should work without an error, which is the case if you set either of:

n_jobs=1
backend="multiprocessing" and add freeze_support()

Especially the latter is interesting and is what I am using as a workaround now.

Which version of Altair are you using?

5.4.0

jonmmease commented 1 month ago

Hi @leoschwarz, thanks for the report. I've transferred this over to the vl-convert repo which implements the image export logic.

The vl-convert bundles the Deno JavaScript runtime which only supports running on a single thread, but my understanding is that the loky backend uses separate processes, so I'm not certain that's the issue. Do you have the same issue using the multiprocessing API?

leoschwarz commented 1 month ago

Thank you for transferring the issue. With multiprocessing I cannot reproduce this issue, i.e. the following works:

from multiprocessing import freeze_support

import altair as alt
import joblib
import os
import pandas as pd

os.environ["RUST_BACKTRACE"] = "full"

def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_point().encode(x="x", y="y")
    chart.save(filename)

if __name__ == "__main__":
    freeze_support()
    filenames = [f"chart{i}.pdf" for i in range(2)]
    joblib.Parallel(n_jobs=10, backend="multiprocessing")(
        joblib.delayed(write_chart)(filename) for filename in filenames)

Taken from the loky README:

"All processes are started using fork + exec on POSIX systems. This ensures safer interactions with third party libraries. On the contrary, multiprocessing.Pool uses fork without exec by default, causing third party runtimes to crash (e.g. OpenMP, macOS Accelerate...)."

So my understanding is they use different fork models, but in this case the default multiprocessing seems to work whereas loky does not. I'm not really an expert on the details of multiprocessing to understand how this relates to Deno's runtime.

jonmmease commented 1 month ago

Thanks for the investigation @leoschwarz. Documentation is probably the best first step, just to let people know that the multiprocessing backend works but loky backend does not.

One thing that we might be able to do is expose an alternative API that doesn't rely on a global instance of the Rust object that wraps Deno. We might be able to expose a VlConverter() class that wraps a dedicated instance of the Deno, that has methods for each of the global vl_convert.* functions. The hope would be that if you create and use this from within the forked process, then everything will work fine since the global Deno instance wouldn't be forked.

leoschwarz commented 1 month ago

I'm not sure if that would fully resolve the problem yet, because my workflow is basically joblib distributing tasks which execute the plotting within a new subprocess each starting its own Python interpreter (largely to avoid this type of problem), so I suspect the problem is located in a native extension doing something unusual with memory somewhere. I'm looking into creating a better example for this.

vega / vl-convert