natverse / fafbseg

Support functions for analysis of Drosophila connectomes especially the FAFB-FlyWire whole brain
https://natverse.org/fafbseg/
GNU General Public License v3.0
6 stars 4 forks source link

consider caching expensive flywire id lookups #67

Open jefferis opened 3 years ago

jefferis commented 3 years ago

rootid->supervoxels

this can be cached essentially permanently since rootids mutate whenever the mapping changes

however they are quite large, so the object store could quickly start taking a lot of space. For example this fairly large neuron

> system.time(fl <- flywire_leaves('720575940619073968'), gcFirst = F)
   user  system elapsed 
  0.278   0.074   4.431 

generates ~ 270000 svids occupying ~ 20 Mb as character vector. This could be reduced to ~ 0.7 Mb by compression of binary data.

jefferis commented 3 years ago

Have taken a look at this. It makes sense to compress the data that is cached. Switching from character to compressed int64 reduces space by ~30x (10x for int64, 3x for compression). gzip is the best general purpose compression and adds about 5% penalty when there is a cache miss and is very fast decompressing. I found that brotli was actually a small improvement but only with a non-default quality=2; this took very slightly longer to decompress but compressed faster and smaller and I think on balance that makes it the winner. The brotli package is maintained by Jeroen Ooms, so although it is another dependency, I eventually opted to add it to suggests and make it the default if available.

For the data compression step specifically for flywire_leaves('720575940619073968', cache = TRUE) which typically takes about 4s:

  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 brotli     18.4ms 19.4ms      50.0  693.54KB    0.251   199     1      3.98s <NULL> <Rpro… <bch…
2 gz         94.3ms 96.8ms      10.2    3.06MB    0.525    39     2      3.81s <NULL> <Rpro… <bch…

For, brotli(q=2) for :

bench::mark(brotli=flywire_leaves('720575940619073968', integer64 = TRUE, cache = TRUE, compression="brotli"), min_time = 5)

the cache read time was ~10 vs 15ms for gzip vs brotli, which was actually dwarfed by the character conversion time (~75ms) when character output was needed.

jefferis commented 3 years ago

The current default is 5000 items in the lru cache. I wonder if it would be wort making it smaller (1000?), but adding an option that could make it larger for expert use when there is plenty of memory available.

jefferis commented 3 years ago

Caching in action:

> kcsel=c("720575940623755722", "720575940609992371", "720575940625494549",
+ "720575940619442047", "720575940620517656", "720575940609793429",
+ "720575940617265029", "720575940631869024", "720575940637441955",
+ "720575940638892789")
> kcpreds=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=32s  
> kcpreds2=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
jefferis commented 3 years ago

Should maybe consider cacheing to disk. Likely still very fast with a good backend and probably quite useful since these mappings are permanent. Also maybe it's better not to risk filling memory. cachem may be a good option now since it is the default for memoise (although not 100% certain it goes back to R 3.5).

jefferis commented 3 years ago

cachem installs ok on linux under R 3.5 and the cache_layered option seems like it would be a nice fit because when keys are set , they are set in all layers (i.e. on disk as well as memory) so this should give a persistent store on disk as well as a fast store in memory.

jefferis commented 3 years ago

So far we have not been caching supervoxel id to root id lookups since they can change. However we could do this as follows:

  1. store svid -> rootid mapping using svid as key
  2. in future, check if the svid exists in the key list
  3. IF YES, 3a. fetch the stored rootid 3b. and then check if that root id has been edited using flywire_islatest 3c. if it has been edited, drop the entry from the key value store and add the svid to the list that need to be queried remotely 3d. note that in fact all keys having that value should be dropped.
  4. IF NO 4a. run remote query 4b. store the result in the key-value store
jefferis commented 3 years ago

Performance considerations:

  1. we want to persist the mapping on disk
  2. many (100s thousands) keys need to be mapped to a short value. Therefore we don't want to use a single object per cached value.
  3. we need to be able to drop keys that have a given value so we actually need fast bidirectional lookup
  4. we must store keys / values as bit64
  5. it would be better if we could use the

This feels like the standard memoise approach isn't going to work and we are going to need a standard database or specialised kv backend. Looking through the R ecosystem, these are what I can find:

jefferis commented 3 years ago

It might make more sense to update invalidated rootids with a signalling value 0 rather than drop them since that might cause a lot of churn and there is a strong chance that we will want to update all the supervoxel ids for an invalidated root id.

That leaves redux/redis which looks somewhat involved and thor which is potentially simpler, but unclear if LMDB supports the kind of reciprocal key/value lookups that we need.

All told, I wonder if a doubly indexed sqlite table might be the simplest option for this use case. The svids should be unique so that could be the primary key of the table.