h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Update Modin join benchmark to current state #162

Open gshimansky opened 3 years ago

gshimansky commented 3 years ago

I updated Modin implementation of join benchmark to current state. Mostly code is copied from Pandas version but there are some differences.

jangorecki commented 3 years ago

Thank you for contributing this script. I am now running modin join benchmark. Will report back when it will finish.

jangorecki commented 3 years ago

Below I am presenting timings made on this PR (precisely speaking, on https://github.com/h2oai/db-benchmark/tree/modin-join-dev).

Quite obvious observation is that there is problem with performance of join question 5: big-to-big join. So 1e7 rows join 1e7 rows, 1e8 to 1e8, 1e9 to 1e9. That is quite common problem for a software that works in distributed manner, you may find this video interesting https://www.youtube.com/watch?v=5X7h1rZGVs0

Another thing, more disturbing actually, are timings values in chk_time_sec column, see below. This field is briefly described in https://github.com/h2oai/db-benchmark/blob/master/_docs/maintenance.md#timecsv document as

chk_time_sec - time taken to compute the chk field, used to track lazyness of evaluation

We generally expect this value to be very low, much lower than the value of time_sec. So we can ensure that computation "A" to be measured by time_sec has been actually computed fully, and not deferred to a computation "B" where results of computation "A" needs to be used. From the values that we can observe here, we could reason that some of join computation might be actually happenning later on, when we are trying to use those not yet fully computed values. What is necessary here to ensure benchmark is fair is to investigate lazyness of those operations. Eventually what could help here is a comment from modin maintainer saying if it is lazy evaluation or not, and explaining why the "computation B" takes so much time.

1e7

Timings for all 5 questions:

                  question run time_sec chk_time_sec
 1:     small inner on int   1    4.470        2.188
 2:     small inner on int   2    1.290        2.278
 3:    medium inner on int   1    3.724        2.460
 4:    medium inner on int   2    2.282        2.561
 5:    medium outer on int   1    2.267        2.438
 6:    medium outer on int   2    2.276        2.567
 7: medium inner on factor   1    2.414        2.561
 8: medium inner on factor   2    2.775        2.559
 9:       big inner on int   1  754.830       95.274
10:       big inner on int   2  714.218       95.663

All joins queries sucessfully finished in 1859s.

1e8

When trying to do first run of q5 python is being Killed.

Timings of q1-q4:

                 question run time_sec chk_time_sec
1:     small inner on int   1   50.147       21.417
2:     small inner on int   2   12.247       21.553
3:    medium inner on int   1   22.070       22.748
4:    medium inner on int   2   21.020       22.955
5:    medium outer on int   1   20.398       22.280
6:    medium outer on int   2   22.891       23.420
7: medium inner on factor   1   22.467       23.794
8: medium inner on factor   2   22.616       24.169

1e9

In case of 1e9 rows data, script is already failing during loading data. Unless modin can handle out-of-memory data this is expected. If modin is able to handle out-of-memory data (does it?), then we should enable that just for 1e9 data size.

Traceback (most recent call last):
  File "./modin/join-modin.py", line 35, in <module>
    x = pd.read_csv(src_jn_x)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 112, in parser_func
    return _read(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/pandas/io.py", line 127, in _read
    pd_obj = EngineDispatcher.read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/dispatcher.py", line 113, in read_csv
    return cls.__engine._read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/data_management/factories/factories.py", line 87, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/file_reader.py", line 29, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/csv_reader.py", line 159, in _read
    is_quoting=is_quoting,
  File "/home/jan/git/db-benchmark/modin/py-modin/lib/python3.6/site-packages/mo
din/engines/base/io/text/text_file_reader.py", line 104, in offset
    chunk = f.read(chunk_size_bytes)
MemoryError
gshimansky commented 3 years ago

I checked with Modin developer @YarShev who knows details about merge operation, that we don't have any lazy computation for it. Performance there is a subject for investigation because I see these problems too, but we didn't figure out the reason for this behavior yet.

As for memory, it looks like no configurations are able to pass 1e9 configuration currently https://h2oai.github.io/db-benchmark/join/J1_1e9_NA_0_0_basic.png so it is not a wonder that Modin behaves like this too. Modin doesn't do any out of memory operations, all data has to fit into memory.