Pyspark pipeline's `describe` should not compute DAG stage count by default

mansenfranzen / pywrangler

Advanced data wrangling for python

https://github.com/mansenfranzen/pywrangler

MIT License

11 stars 4 forks source link

Pyspark pipeline's `describe` should not compute DAG stage count by default #19

Open mansenfranzen opened 4 years ago

mansenfranzen commented 4 years ago

Currently, the number of stages of a DAG computation graph is extracted when a pyspark pipeline's describe method is called. This may take an unreasonable amount of time for large computation graphs and pipelines with many stages. There should be a parameter like dag_stage_count or similar to activate the corresponding computation. By default, it should be deactivated.

mansenfranzen commented 4 years ago

Additionally, it should provide the number of exchanges and sorts for better interpretability.