Open gshimansky opened 3 years ago
Thank you for contributing this script. I am now running modin join benchmark. Will report back when it will finish.
Below I am presenting timings made on this PR (precisely speaking, on https://github.com/h2oai/db-benchmark/tree/modin-join-dev).
Quite obvious observation is that there is problem with performance of join question 5: big-to-big join. So 1e7 rows join 1e7 rows, 1e8 to 1e8, 1e9 to 1e9. That is quite common problem for a software that works in distributed manner, you may find this video interesting https://www.youtube.com/watch?v=5X7h1rZGVs0
Another thing, more disturbing actually, are timings values in chk_time_sec
column, see below. This field is briefly described in https://github.com/h2oai/db-benchmark/blob/master/_docs/maintenance.md#timecsv document as
chk_time_sec
- time taken to compute thechk
field, used to track lazyness of evaluation
We generally expect this value to be very low, much lower than the value of time_sec
. So we can ensure that computation "A" to be measured by time_sec
has been actually computed fully, and not deferred to a computation "B" where results of computation "A" needs to be used.
From the values that we can observe here, we could reason that some of join computation might be actually happenning later on, when we are trying to use those not yet fully computed values.
What is necessary here to ensure benchmark is fair is to investigate lazyness of those operations.
Eventually what could help here is a comment from modin maintainer saying if it is lazy evaluation or not, and explaining why the "computation B" takes so much time.
Timings for all 5 questions:
question run time_sec chk_time_sec
1: small inner on int 1 4.470 2.188
2: small inner on int 2 1.290 2.278
3: medium inner on int 1 3.724 2.460
4: medium inner on int 2 2.282 2.561
5: medium outer on int 1 2.267 2.438
6: medium outer on int 2 2.276 2.567
7: medium inner on factor 1 2.414 2.561
8: medium inner on factor 2 2.775 2.559
9: big inner on int 1 754.830 95.274
10: big inner on int 2 714.218 95.663
All joins queries sucessfully finished in 1859s.
When trying to do first run of q5 python is being Killed
.
Timings of q1-q4:
question run time_sec chk_time_sec
1: small inner on int 1 50.147 21.417
2: small inner on int 2 12.247 21.553
3: medium inner on int 1 22.070 22.748
4: medium inner on int 2 21.020 22.955
5: medium outer on int 1 20.398 22.280
6: medium outer on int 2 22.891 23.420
7: medium inner on factor 1 22.467 23.794
8: medium inner on factor 2 22.616 24.169
In case of 1e9 rows data, script is already failing during loading data. Unless modin can handle out-of-memory data this is expected. If modin is able to handle out-of-memory data (does it?), then we should enable that just for 1e9 data size.
Traceback (most recent call last):
File "./modin/join-modin.py", line 35, in <module>
x = pd.read_csv(src_jn_x)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 112, in parser_func
return _read(**kwargs)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 127, in _read
pd_obj = EngineDispatcher.read_csv(**kwargs)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/dispatcher.py", line 113, in read_csv
return cls.__engine._read_csv(**kwargs)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/factories.py", line 87, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/file_reader.py", line 29, in read
query_compiler = cls._read(*args, **kwargs)
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/csv_reader.py", line 159, in _read
is_quoting=is_quoting,
File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/text_file_reader.py", line 104, in offset
chunk = f.read(chunk_size_bytes)
MemoryError
I checked with Modin developer @YarShev who knows details about merge operation, that we don't have any lazy computation for it. Performance there is a subject for investigation because I see these problems too, but we didn't figure out the reason for this behavior yet.
As for memory, it looks like no configurations are able to pass 1e9
configuration currently https://h2oai.github.io/db-benchmark/join/J1_1e9_NA_0_0_basic.png so it is not a wonder that Modin behaves like this too. Modin doesn't do any out of memory operations, all data has to fit into memory.
I updated Modin implementation of join benchmark to current state. Mostly code is copied from Pandas version but there are some differences.