mcabbott / AxisKeys.jl

🎹
MIT License
149 stars 28 forks source link

Slicing with larger key vectors is slow #146

Open jtackm opened 12 months ago

jtackm commented 12 months ago

Thanks for this great package! I find myself regularly slicing large KeyedArray matrices with large vectors of string keys (about 10% of the matrix). This is unfortunately currently slow:

using Random, AxisKeys, BenchmarkTools
A = KeyedArray(zeros(100000, 100), sid=["S$i" for i in 1:100000], oid=["O$i" for i in 1:100])
sub_sids = rand(axiskeys(A, 1), 10000)
A_sub_slow = @btime A[Key(sub_sids), :] 
# run time: 6.156 s

On my real data it can take minutes, which is why I regularly find myself using an indexin workaround:

A_sub_fast = @btime begin
    sub_sids_idx = indexin(sub_sids, axiskeys(A, 1))
    A[sub_sids_idx, :]
end 
# run time: 26.203 ms

@assert A_sub_slow == A_sub_fast

The slow method becomes much faster when slicing the matrix at the beginning (sub_sids = axiskeys(A, 1)[1:10000], 255.530 ms) and much slower when slicing at the end (sub_sids = axiskeys(A, 1)[end-10000:end], 18.904 s). The fast method is faster in all scenarios: 12.335 ms (slice beginning), 15.037 ms (slice end).

Only when performing small slices (100 elements) at the beginning can I see advantages of the default method (36.438 μs vs 77.185 μs). I may be missing something, but could the current method perhaps make use of indexin internally?

aplavin commented 12 months ago

As noted in the README, AxisKeys uses whatever search method your axiskeys arrays provide (as findfirst method). Regular arrays do linear search in their findfirst, that's why it is slow for many indices. AxisKeys is composable though, so you can use any array type with faster search:

julia> using UniqueVectors

julia> A = KeyedArray(zeros(100000, 100), sid=UniqueVector(["S$i" for i in 1:100000]), oid=["O$i" for i in 1:100])

julia> @btime A[Key(sub_sids), :]
  4.374 ms (25 allocations: 7.78 MiB)

I guess it could do indexin automatically in these cases, but for now it always does findfirst.

jtackm commented 12 months ago

Oh wow, that's a great feature! This capability didn't become apparent to me when reading the README, but maybe that's just me? I had tried with Set at one point (as a blind guess) in hope of faster lookups. Anyways, thanks a lot for the pointer.