modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.87k stars 651 forks source link

Run benchmarks using modin in omniscripts #1736

Closed Garra1980 closed 1 year ago

Garra1980 commented 4 years ago

We want to add possibility to run tests using modin in omniscripts to establish regular testing and get results in Grafana Currently we want to be able to run following benchmarks in 2 modes - modin-on-ray and modin-on-omnisci (sorted by importance):

Garra1980 commented 4 years ago

Initial implementation: in Omnisci: https://github.com/intel-go/omniscidb/pull/53 in Omniscripts: so far in branch https://github.com/intel-go/omniscripts/tree/modin_changes

We also should think about enabling python3.8 in omniscripts

Garra1980 commented 4 years ago

This is high priority now - we are enabling h2o, taxi and mortgage, bunch of bugs were fixed already, mostly we are struggling with type conversions between arrow and pandas.

Garra1980 commented 4 years ago

To start using omniscripts with Modin&Omnisci:

amyskov commented 4 years ago

Examples of commands to run benchmarks in Modin_on_omnisci mode: NYC taxi: python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -data_file '/localdisk/amyskov/benchmark_datasets/taxi/hundreed_k_trips_xa{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t}.csv' -dfiles_num 1 -bench_name ny_taxi -no_ibis true -pandas_mode Modin_on_omnisci

Mortgage: python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -i /localdisk/amyskov/ibis/ -data_file /localdisk/benchmark_datasets/mortgage-transformed -bench_name mortgage -no_ibis true -no_ml true -pandas_mode Modin_on_omnisci

H2O: python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -data_file "/localdisk/benchmark_datasets/h2o/G1_1e7_1e2_0_0.csv" -bench_name h2o -no_ibis true -pandas_mode Modin_on_omnisci

amyskov commented 4 years ago

Reported to MySQL database results can be found on the Grafana server in the Modin directory in ETL execution times vs date-time for NYC taxi and Mortgage benchmarks and [h2o 5 GB dataset] queries execution times vs date for H2O benchmark dashboards. Grafana server URL can be found in the Jira ticket 217.

Garra1980 commented 4 years ago

Need to re-check this with latest Modin

anmyachev commented 4 years ago

fix taxi launch - https://github.com/intel-ai/omniscripts/pull/148

anmyachev commented 4 years ago

census: python run_ibis_tests.py -executable /localdisk/amyachev/omniscidb/build/bin/omnisci_server -task build,benchmark --env_name test-env-new --env_check True --save_env True --modin_path /localdisk/amyachev/modin/ -data_file '/localdisk/amyskov/benchmark_datasets/census/ipums_education2income_1970-2010.csv' -dfiles_num 1 -bench_name census -no_ibis true -pandas_mode Modin_on_omnisci -no_ml true

Some notes:

amyskov commented 4 years ago

Some corrections are performed in omniscripts and now results are properly reported for most of the benchmarks. Also dashboards for Census benchmark are added to the ETL execution times vs date-time for NYC taxi, Mortgage and Census benchmarks tables. Next steps: prepare rest of dashboards for H2O benchmark, support modin-on-omnisci mode for python3.8 in omniscripts.

Garra1980 commented 4 years ago

Mortgage seems to be broken:

  File "run_ibis_benchmark.py", line 425, in main
    benchmark_results = run_benchmark(parameters)
  File "/localdisk/izamyati/omniscripts/mortgage/mortgage_runner.py", line 231, in run_benchmark
    df_pd, mb_pd, etl_times_pd = _etl_pandas(parameters, acq_schema, perf_schema, etl_keys)
  File "/localdisk/izamyati/omniscripts/mortgage/mortgage_runner.py", line 49, in _etl_pandas
    pandas_mode=parameters["pandas_mode"],
  File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 474, in etl_pandas
    pd_dfs.append(mb.run_cpu_workflow(quarter=quarter, year=year, perf_file=fname))
  File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 78, in run_cpu_workflow
    names = self.pd_load_names()
  File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 150, in pd_load_names
    df = pd.read_csv(self.col_names_path, names=cols, delimiter="|")
  File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/pandas/io.py", line 112, in parser_func
    return _read(**kwargs)
  File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/pandas/io.py", line 127, in _read
    pd_obj = EngineDispatcher.read_csv(**kwargs)
  File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/data_management/dispatcher.py", line 115, in read_csv
    return cls.__engine._read_csv(**kwargs)
  File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/data_management/factories.py", line 56, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/experimental/engines/omnisci_on_ray/io.py", line 198, in read_csv
    convert_options=co,
  File "pyarrow/_csv.pyx", line 583, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 1

Also we need to get rid of ibis dependency for mortgage

Garra1980 commented 4 years ago

For more precise time measurements in case of modin-on-omnisci we need to initialize calcite connection by running smth like

    data = {"a" : [1, 2, 3]}
    df = pd.DataFrame(data);
    df = df + 1
    print(df)

prior a workload itself.

Also all tests have to be tested with MODIN_USE_CALCITE=True in environment

ienkovich commented 4 years ago

We can avoid print by doing smth like _ = df.index

Garra1980 commented 4 years ago

Right, thanks, I should have changed that

Garra1980 commented 4 years ago

https://github.com/intel-ai/modin/blob/omnisci-on-ray/modin/experimental/engines/omnisci_on_ray/io.py#L167 might be a problem since now delimiter does not set up correctly as far as I can see

Garra1980 commented 3 years ago

We have Mortgage left which is not top priority now

anmyachev commented 1 year ago

@Garra1980 could we close that?