Closed Garra1980 closed 1 year ago
Initial implementation: in Omnisci: https://github.com/intel-go/omniscidb/pull/53 in Omniscripts: so far in branch https://github.com/intel-go/omniscripts/tree/modin_changes
We also should think about enabling python3.8 in omniscripts
This is high priority now - we are enabling h2o, taxi and mortgage, bunch of bugs were fixed already, mostly we are struggling with type conversions between arrow and pandas.
To start using omniscripts with Modin&Omnisci:
Examples of commands to run benchmarks in Modin_on_omnisci mode:
NYC taxi:
python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -data_file '/localdisk/amyskov/benchmark_datasets/taxi/hundreed_k_trips_xa{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t}.csv' -dfiles_num 1 -bench_name ny_taxi -no_ibis true -pandas_mode Modin_on_omnisci
Mortgage:
python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -i /localdisk/amyskov/ibis/ -data_file /localdisk/benchmark_datasets/mortgage-transformed -bench_name mortgage -no_ibis true -no_ml true -pandas_mode Modin_on_omnisci
H2O:
python run_ibis_tests.py -executable /localdisk/amyskov/omniscidb/build6/bin/omnisci_server -task build,benchmark --env_name env_name --env_check True --save_env True --modin_path /localdisk/amyskov/modin_omnisci/ -data_file "/localdisk/benchmark_datasets/h2o/G1_1e7_1e2_0_0.csv" -bench_name h2o -no_ibis true -pandas_mode Modin_on_omnisci
Reported to MySQL database results can be found on the Grafana server in the Modin directory in ETL execution times vs date-time for NYC taxi and Mortgage benchmarks
and [h2o 5 GB dataset] queries execution times vs date for H2O benchmark
dashboards. Grafana server URL can be found in the Jira ticket 217.
Need to re-check this with latest Modin
fix taxi launch - https://github.com/intel-ai/omniscripts/pull/148
census:
python run_ibis_tests.py -executable /localdisk/amyachev/omniscidb/build/bin/omnisci_server -task build,benchmark --env_name test-env-new --env_check True --save_env True --modin_path /localdisk/amyachev/modin/ -data_file '/localdisk/amyskov/benchmark_datasets/census/ipums_education2income_1970-2010.csv' -dfiles_num 1 -bench_name census -no_ibis true -pandas_mode Modin_on_omnisci -no_ml true
Some notes:
-no_ml False
not supported on omnisci mode due to NotImplementedError: to_numpy is not yet suported in DFAlgQueryCompiler
-pandas_mode
to Modin_on_ray
Some corrections are performed in omniscripts and now results are properly reported for most of the benchmarks. Also dashboards for Census benchmark are added to the ETL execution times vs date-time for NYC taxi, Mortgage and Census benchmarks
tables.
Next steps: prepare rest of dashboards for H2O benchmark, support modin-on-omnisci
mode for python3.8 in omniscripts.
Mortgage seems to be broken:
File "run_ibis_benchmark.py", line 425, in main
benchmark_results = run_benchmark(parameters)
File "/localdisk/izamyati/omniscripts/mortgage/mortgage_runner.py", line 231, in run_benchmark
df_pd, mb_pd, etl_times_pd = _etl_pandas(parameters, acq_schema, perf_schema, etl_keys)
File "/localdisk/izamyati/omniscripts/mortgage/mortgage_runner.py", line 49, in _etl_pandas
pandas_mode=parameters["pandas_mode"],
File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 474, in etl_pandas
pd_dfs.append(mb.run_cpu_workflow(quarter=quarter, year=year, perf_file=fname))
File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 78, in run_cpu_workflow
names = self.pd_load_names()
File "/localdisk/izamyati/omniscripts/mortgage/mortgage_pandas.py", line 150, in pd_load_names
df = pd.read_csv(self.col_names_path, names=cols, delimiter="|")
File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/pandas/io.py", line 112, in parser_func
return _read(**kwargs)
File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/pandas/io.py", line 127, in _read
pd_obj = EngineDispatcher.read_csv(**kwargs)
File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/data_management/dispatcher.py", line 115, in read_csv
return cls.__engine._read_csv(**kwargs)
File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/data_management/factories.py", line 56, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "/localdisk/izamyati/miniconda3/envs/test-env/lib/python3.7/site-packages/modin-0.8.1+15.g333e2a0.dirty-py3.7.egg/modin/experimental/engines/omnisci_on_ray/io.py", line 198, in read_csv
convert_options=co,
File "pyarrow/_csv.pyx", line 583, in pyarrow._csv.read_csv
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 1
Also we need to get rid of ibis dependency for mortgage
For more precise time measurements in case of modin-on-omnisci we need to initialize calcite connection by running smth like
data = {"a" : [1, 2, 3]}
df = pd.DataFrame(data);
df = df + 1
print(df)
prior a workload itself.
Also all tests have to be tested with MODIN_USE_CALCITE=True in environment
We can avoid print by doing smth like _ = df.index
Right, thanks, I should have changed that
https://github.com/intel-ai/modin/blob/omnisci-on-ray/modin/experimental/engines/omnisci_on_ray/io.py#L167 might be a problem since now delimiter does not set up correctly as far as I can see
We have Mortgage left which is not top priority now
@Garra1980 could we close that?
We want to add possibility to run tests using modin in omniscripts to establish regular testing and get results in Grafana Currently we want to be able to run following benchmarks in 2 modes - modin-on-ray and modin-on-omnisci (sorted by importance):
[x] nyc taxi
[x] Census
[x] h2o
[ ] Mortgage
[x] Plasticc