sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.01k stars 204 forks source link

Does the community consider run dataprep eda on Yarn? #771

Closed Bowen0729 closed 2 years ago

Bowen0729 commented 2 years ago

My company use Hadoop eco system for bigdata, means that we have a Yarn cluster without a dask cluster. As I know, dask can run on Yarn,recently,I tried to run dataprep on yarn, and it worked well. So, does the community consider support dataprep on yarn? and we can work on it toghter.

dovahcrow commented 2 years ago

Hi @Bowen0729, that is good news to hear. We always want DataPrep to have more ecosystem integrations. May I ask if you can write down how you run DataPrep on the Yarn cluster and then we can convert that into a page in our docmentation?

Bowen0729 commented 2 years ago

Sure!@dovahcrow

  1. Install dask-yarn with pip

    pip install dask-yarn

  2. Ensure that the libraries used on the Yarn cluster are the same as what you are using locally.

    using conda-pack package a conda environment conda-pack

    and upload the archive to HDFS hdfs:///mypath/archive.tar.gz

  3. Run dataprep eda with the following:

    from dask_yarn import YarnCluster from dataprep.eda import create_report from dask.distributed import Client from dask.dataframe as dd

    cluster = YarnCluster(environment='archive.tar.gz', worker_memory='10GiB', worker_vcores=4, scheduler_memory='1GiB')

    cluster.scale(4) client = Client(cluster) ddf = dd.read_parquet(hdfs:///data-path/data.parquet)

    create_report(ddf)

Spark supports multiple cluster manager, such as standalone, mesos, hadoop yarn or kubernetes, and I think dataprep based on dask can handle bigdata, which is the advantage over other frameworks, so does dataprep eda need to support other cluster manager? and it will be more friendly to bigdata scenarios, what do you think?

If it is necessary, we can talk about how to design the dataprep on yarn, perhaps user can choose the running mode.

If it is not necessary, I will open a pr for dataprep on yarn docmentation after you verified, and it's my pleasure to be a contributor of dataprep

jinglinpeng commented 2 years ago

Hi @Bowen0729 , thanks a lot for the detailed steps! Currently we do not have enough people to make dataprep work on Yarn, which needs many optimizations and testings.

It would be very nice if you could add the doc for Yarn! You could add a section about Yarn in this file: https://github.com/sfu-db/dataprep/blob/develop/docs/source/installation.rst and then open a PR. Thanks for being a contributor of dataprep!

Bowen0729 commented 2 years ago

Thank you for reply @jinglinpeng I will finish it in next few days!