julioasotodv / spark-df-profiling

Create HTML profiling reports from Apache Spark DataFrames
MIT License
195 stars 77 forks source link

profile multiple files in parallel #6

Closed mike-vogel closed 7 years ago

mike-vogel commented 7 years ago

Is there a way to profile multiple files in parallel. When I start a thread for each file using the same spark-context I get random failures from matplotlib apparently due to it not being thread safe.

julioasotodv commented 7 years ago

Yes, that is expected. Matplotlib is not thread-safe, and there is no current plan to change it. Also, Spark itself is created to be 100% sequential on the driver side; the multicore/distributed engine kicks in on a per-dataset basis. If your datasets are small enough to fit into a single machine, you may want to take a look at pandas-profiling...