Closed mike-vogel closed 7 years ago
Yes, that is expected. Matplotlib is not thread-safe, and there is no current plan to change it. Also, Spark itself is created to be 100% sequential on the driver side; the multicore/distributed engine kicks in on a per-dataset basis. If your datasets are small enough to fit into a single machine, you may want to take a look at pandas-profiling...
Is there a way to profile multiple files in parallel. When I start a thread for each file using the same spark-context I get random failures from matplotlib apparently due to it not being thread safe.