root-11 / tablite

multiprocessing enabled out-of-memory data analysis library for tabular data.
MIT License
37 stars 8 forks source link

Table.load very slow with dtype('O') #125

Closed root-11 closed 9 months ago

root-11 commented 10 months ago

In the overview below (t)ime, stdev and count (of pages):

loading 'kyle_cs_r.tpz' file: 100%|███████████████████████| 41/41 [00:00<00:00, 124.14it/s]
dtype('int64') t: 0.00438, stdev: 0.00000, count: 1
dtype('<U46')  t: 0.02037, stdev: 0.00000, count: 1
dtype('O')     t: 0.00877, stdev: 0.00095, count: 27
dtype('<U36')  t: 0.00767, stdev: 0.00000, count: 1
dtype('<U58')  t: 0.01238, stdev: 0.00164, count: 2
dtype('<U20')  t: 0.00334, stdev: 0.00064, count: 2
dtype('<U11')  t: 0.00199, stdev: 0.00000, count: 1
dtype('<U23')  t: 0.00536, stdev: 0.00057, count: 2
dtype('<U21')  t: 0.00385, stdev: 0.00067, count: 3
dtype('<U26')  t: 0.00438, stdev: 0.00000, count: 1

loading 'kyle_cs_l.tpz' file: 100%|█████████████████████████| 747/747 [10:14<00:00,  1.22it/s]
dtype('<U2')     t: 0.01451, stdev: 0.01045, count: 166
dtype('int64')   t: 0.02732, stdev: 0.00889, count: 332
dtype('O')       t: 3.61340, stdev: 0.35634, count: 166
dtype('float64') t: 0.03205, stdev: 0.01417, count: 83

Note the second second last record: dtype('O') 3.61 sec / page !

My conclusion is that we must resort to type 'O' as the last resort.

Question: @realratchet - can you imagine a way I could reindex the structs without loading them?

realratchet commented 10 months ago

I'm not exactly sure what would you use re-indexing for if it's not loaded. Unless you mean you already have it re-indexed based on some other criteria and just want to blindly select the values from the array.

If that's the case then no, it cannot be improved in how it is now as the pickle format doesn't allow for random access. However, the nim implementation should definitely be faster. I implemented unpickler fully in nim whereas pythons unpickler is written in native python and is not a C binding which makes it slow. I haven't benchmarked it vs python implementation but I'm pretty sure it would be faster and there's places where it could be made even faster.

root-11 commented 10 months ago

...re-indexing for if it's not loaded...

When I have an index and need to re-arrange fields in the right table during a join, the actual value that is being re-ordered doesn't matter. It only matters that the values are put in the right order.

For example a join where the right side index is [4,2,3,1] all fields on all pages would have to be re-ordered to match the order that is dictated by the index. So it doesn't matter whether it's a struct, int, ... whatever as long as the page that is output contains the bytes in the correct order.

realratchet commented 10 months ago

Then no it is not possible with pickle format as that requires random access as bytes are not aligned so you're forced to read the entire page. It requires reading the entire file so only saving grace would be speeding up reading process.

root-11 commented 10 months ago

Ok. So the advice is "keep datatypes simple if you want speed."

root-11 commented 9 months ago

Solved.