Closed root-11 closed 9 months ago
I'm not exactly sure what would you use re-indexing for if it's not loaded. Unless you mean you already have it re-indexed based on some other criteria and just want to blindly select the values from the array.
If that's the case then no, it cannot be improved in how it is now as the pickle format doesn't allow for random access. However, the nim implementation should definitely be faster. I implemented unpickler fully in nim whereas pythons unpickler is written in native python and is not a C binding which makes it slow. I haven't benchmarked it vs python implementation but I'm pretty sure it would be faster and there's places where it could be made even faster.
...re-indexing for if it's not loaded...
When I have an index and need to re-arrange fields in the right
table during a join, the actual value that is being re-ordered doesn't matter. It only matters that the values are put in the right order.
For example a join where the right side index is [4,2,3,1] all fields on all pages would have to be re-ordered to match the order that is dictated by the index. So it doesn't matter whether it's a struct, int, ... whatever as long as the page that is output contains the bytes
in the correct order.
Then no it is not possible with pickle format as that requires random access as bytes are not aligned so you're forced to read the entire page. It requires reading the entire file so only saving grace would be speeding up reading process.
Ok. So the advice is "keep datatypes simple if you want speed."
Solved.
In the overview below (t)ime, stdev and count (of pages):
Note the second second last record: dtype('O') 3.61 sec / page !
My conclusion is that we must resort to type 'O' as the last resort.
Question: @realratchet - can you imagine a way I could reindex the structs without loading them?