linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
137 stars 36 forks source link

ds.scan is slow #89

Closed astaric closed 5 years ago

astaric commented 5 years ago

I had to process all rows in a loom file (108999 rows, 11930 columns) without loading the file into the memory.

I timed the execution of the following code:

def loompyscan(ds): 
    for i, _, view in ds.scan(axis=0, layers=[""], batch_size=10000): 
        ngt = view[:, :]; print(i)  

and got

21.2 s ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while doing the partitioning manually and accessing the layer directly using slices:

def myscan(ds): 
    STEP=10000  
    for i in range(ds.shape[0]//STEP):  
        ngt = ds[i*STEP:(i+1)*STEP, :]; print(i)

takes

7.2 s ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slinnarsson commented 5 years ago

ds.scan() does quite a bit more than just scan the main matrix. For example, it also slices through all the attributes and graphs, reorders the result according to the key attribute, and supports scanning through only a subset of the rows/columns. That said, a 3x slowdown seems a lot. If you can figure out why it's slower, please send a pull request!

One issue might be reordering each view (which is done even if no key was provided, which is really unnecessary). Similarly, even if no selection (items) was requested, the code actually performs the selection on every slice. These two things could be optimized for the common case of just scanning through all the file with no selection and reordering, at the expense of making the code a bit messier.

astaric commented 5 years ago

After the change in #90, scan takes

8.68 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
slinnarsson commented 5 years ago

Fixed by PR #90