scijava / scijava-common

A plugin framework and application container with built-in extensibility mechanism :electric_plug:
BSD 2-Clause "Simplified" License
87 stars 52 forks source link

Add LRU cache service #142

Open hinerm opened 9 years ago

hinerm commented 9 years ago

in https://github.com/scijava/scijava-common/issues/141 we stated the need for caching but not the method by which we would cache.

In the long term we should have a generalized LRU cache that we can use to enforce memory constraints but allow caching of items that end up being used frequently.

API considerations

Potential design: give the cache an object and the way to re-generate the object when needed... cache stores both, returns an object that then accesses the object and knows how to re-create it if needed. This seems maybe too dangerous though..?

ctrueden commented 9 years ago

Instead of "held in memory for a period of time" it would be easier to code a feature like "held in memory until the thing is retrieved at least X times."

IdealOutage commented 9 years ago

I did a little research on this matter and in my opinion the easiest way to implement a simple LRU Cache would be to use a LinkedHashMap, even the JavaDoc states:

This kind of map is well-suited to building LRU caches.

The only thing we have to do is to overwrite the removeEldestEntry to eliminate the oldest entry when the size of the map is bigger than a previously specified maximum size.

    @Override
    protected boolean removeEldestEntry(final Map.Entry<A, B> eldest) {
        return size() > maxSize;
    }

This cache wouldn't be concurrent, but this could be fixed using Collections.synchronizedMap. 90% of the other tips for implementing a cache in Java were "Use Google Guavas CacheBuilder"

But this still leaves the design decision open. Should the cache create the object if it isn't cached anymore, or should the programmer create the object himself if it isn't cached anymore? The first one would be more convenient, but in my opinion object creation isn't a task for a cache.

What do you think?

dietzc commented 9 years ago

I think this is not an "or". These are two different cache-types in my opinion. The first cache is a cache we would typically use in KNIME: "Hey, is the object still there? No? Ok, no worries we will take care about it". The second cache is what @hinerm does with the SCIFIOCellImg. So both are actually caches.

hinerm commented 9 years ago

90% of the other tips for implementing a cache in Java were "Use Google Guavas CacheBuilder"

For a cache data structure, CacheBuilder looks great; it has soft/weak refs and LRU eviction.

@ctrueden may have thoughts here but my assumption is that we won't want to include a guava dependency at the SJC level and would encapsulate the dependency in another (caching or guava-related) component. But SJC would host the CacheService interface.

I do want to clarify that this issue is about building a cache _Service_ and not a cache data structure. The service acts as a centralized bridge to a caching data structure(s). My thought was just to generalize the SCIFIOCellCache model to a pluggable caching system, where cache entries (the plugins) can choose how their object is stored (e.g. in a backup disk cache with SCIFIOCellImg) and know how to re-create their object if it has been evicted (whether by LRU or lost to GC).

dietzc commented 9 years ago

All right, if we can define a nice Service interface, then we can have our own implementation, based on whatever, of a LRUCacheService. From my point of view there should be a pure Cache which doesn't know how to recreate the object and maybe another (SmartCacheService or however you want to call it), which uses the pure Cache and additionally know how to recreate objects. Like that, we cover both use-cases without having redundant implementations.

Does this make sense?

hinerm commented 9 years ago

Well, multiple implementations of a given Service will override each other. The service can't know how to create objects, but Handler plugins to the service can. We could make a default low-priority plugin backed by CacheBuilder that returns null if the requested object's not found.

To me I think it's an advantage to keep everything in a centralized cache instead of having one smart and one not. Otherwise you would have to check each cache separately, and it becomes ambiguous what to evict when hitting memory constraints - right?

Quite possible that I'm misunderstanding something fundamental that's obvious to you and this design is bad. If so, please let me know. Or if it sounds like the cache service I'm describing won't meet your needs, can you explain your use case?

dietzc commented 9 years ago

Well, multiple implementations of a given Service will override each other.

Of course they will. What I suggest are two different interfaces for two different concepts. The first one is a Cache which doesn't know how to re-create the input, the second one is a Cache which knows hows to recreate the input. And if one wants, she can implement the second one wrapping the first one. Does this make sense?

I just don't see one unified Cache solution, because already just the two of us have different requirements to such a Cache (the two options described above).

IdealOutage commented 9 years ago

But why do we need two interfaces? To have a simple example, the two Caches should both have a get and a put, the rest is just implementation details.

public LRUCache extends Cache {
  private Map<K, V> cache;

  @Override
  public V get(K k){
    if(map.contains(k)) return map.get(k);
    return null;
  }
}
public SmartCache extends Cache {

  private LRUCache lruCache;
  private CreationStrategy creationStrategy;  

  @Override
  public V get(K k){
    V v = lruCache.get(k);
    if(v != null) return v;

    // otherwise create v from scratch
    return creationStrategy.create(k);
  }
}
hinerm commented 9 years ago

And if one wants, she can implement the second one wrapping the first one. Does this make sense?

Makes sense but I think can be simplified and more extensible if we use plugins to decide how things are stored/created.

I made a branch with a _very_ bare-bones CacheService to try and illustrate what I planned for this ticket.

The idea is that if something knows how to do fancy things with a given object to cache, it can. Even if not, everything is in a central cache.

Then you have plugins for individual data types which can decide if and how to recreate cached items.

Also, the CachePlugin could actually just be a ValueLoader so that the guava cache framework takes care of re-creating the value if needed.

Let me know if any of that is unclear, and in what ways it wouldn't meet your cache requirements.

ctrueden commented 9 years ago

my assumption is that we won't want to include a guava dependency at the SJC level and would encapsulate the dependency in another (caching or guava-related) component.

I am increasingly ambivalent. Guava brings a whole lot of richness to the table, and avoids reinventing many wheels. For example, did you know Guava has its own event bus?

For the time being, isolating the dependency elsewhere is probably the best move, to avoid disruption. But there may come a time when the dependency for SJC as a whole makes sense.

IdealOutage commented 9 years ago

@hinerm

I see one big problem with this idea, the CachePlugin will be evicted if the Cache is full. For example if we put("test", new CachePlugin("test")); into our Cache and want to retrieve the value later it could happen that cache.getIfPresent(key); will give us null if our value was evicted and then we can't recreate the value. We need to know what kind of Object we store for a given key, either we use get(String key, Class<V> valueClass) then we would be able to create a CachePlugin from the given class or we could make the CacheService generic so that it can only store one kind of object (would that even be possible? Can we create a generic CacheService if we use context.getService(CacheService.class)?) or something completely different.

A completely different topic, I would advice against is using weak values, since that would defeat the purpose of the cache imho. WeakReferences will be garbage collected as soon as the GC has the chance to do so, even if there is still room in our cache and we have plenty of free memory the objects will be evicted. I would suggest using soft values, since SoftReferences will only be garbage collected if we run out of memory or using a predefined fixed limit.

hinerm commented 9 years ago

@DanielSeebacher

I see one big problem with this idea, the CachePlugin will be evicted if the Cache is full.

Yes I agree, sorry for not fleshing out my example further. If CachePlugins are ValueLoaders we may bypass the wrapping and storing of CachePlugins completely. As-is, you're right that the Guava Cache should probably have strong refs tied to the CachePlugins - but different plugin implementations could still choose to keep weak or soft references to their underlying object.

we could make the CacheService generic so that it can only store one kind of object (would that even be possible?

This is possible - it is basically what I did with the SCIFIOCellCache. I don't think it's useful for a central cache service though. I completely agree that we would want an option for strong return typing.

since SoftReferences will only be garbage collected if we run out of memory

This isn't quite my understanding of SoftReferences. SoftReferences are guaranteed to be cleared before OutOfMemoryError is thrown - but as soon as a reference becomes softly reachable the garbage collector can do what it want.

I think the most robust option would be not to force weak or soft references, but either let the CachePlugin decide, or lean on a Weigher implementation that evicts based on memory footprint, for example.

IdealOutage commented 9 years ago

This isn't quite my understanding of SoftReferences. SoftReferences are guaranteed to be cleared before OutOfMemoryError is thrown - but as soon as a reference becomes softly reachable the garbage collector can do what it want.

@hinerm Yep you're right, we should really rely on something deterministic. Now we just need to create an interface we all can agree upon. Sadly I can't push to SciJava, so I have to show my design here, I would love some feedback.

I would keep the base interface relatively simple, e.g.

public interface CacheService extends SciJavaService {
    /**
     * Stores the given object in the cache.
     * 
     * @param key
     *            A key.
     * @param value
     *            A value.
     */
    <K, V> void put(K key, V value);

    /**
     * @param key
     *            A key
     * @return The cached object, or null if the object is not in the cache.
     */
    <K, V> V get(K key);

    /**
     * 
     * @param key
     *            A key
     * @param valueLoader
     *            A value loader which will be used if null is returned for the
     *            given key.
     * @return The cached object, or if the object is not in the cache the
     *         result of the value loader.
     * @throws ExecutionException 
     */
    <K, V> V get(K key, Callable<V> valueLoader) throws ExecutionException;
}

And the DefaultCacheService could also be kept relatively simple:

@Plugin(type = Service.class)
public class DefaultCacheService extends AbstractService implements
        CacheService {

    private Cache<Object, Object> cache;

    @Override
    public void initialize() {
        cache = CacheBuilder.newBuilder().maximumSize(100).build();
    }

    @Override
    public <K, V> void put(K key, V value) {
        cache.put(key, value);
    }

    @SuppressWarnings("unchecked")
    @Override
    public <K, V> V get(K key) {
        return (V) cache.getIfPresent(key);
    }

    @SuppressWarnings("unchecked")
    @Override
    public <K, V> V get(K key, Callable<V> valueLoader)
            throws ExecutionException {
        return (V) cache.get(key, valueLoader);
    }
}

With that we already have an extremely powerful and versatile Cache. We can use any kind of object as a key (as long as it implements equals() and hashcode()). I already tested it and it works really well, here is a working JUnit Test as an example.

/**
 * Test if different objects can be used as keys and values.
 */
@Test
public void testStorageAndRetrieval() {
    cacheService.put("test 1", 1);
    cacheService.put(2, "test 2");
    cacheService.put(new Double(3), "test 3");
    cacheService.put("test 4", new Double(4));

    int result1 = cacheService.get("test 1");
    assertNotNull("Get <String, Integer> not null", result1);
    assertEquals("Get <String, Integer> value", 1, result1);

    String result2 = cacheService.get(2);
    assertNotNull("Get <Integer, String> not null", result2);
    assertEquals("Get <Integer, String> value", "test 2", result2);

    String result3 = cacheService.get(new Double(3));
    assertNotNull("Get <Double, String> not null", result3);
    assertEquals("Get <Double, String> value", "test 3", result3);

    double result4 = cacheService.get("test 4");
    assertNotNull("Get <String, Double> not null", result4);
    assertEquals("Get <String, Double> value", 4d, result4,
            Double.MIN_VALUE);
}
ctrueden commented 9 years ago

Sadly I can't push to SciJava, so I have to show my design here, I would love some feedback.

If that is an obstacle, you can fork the repo of interest and push a topic branch and start a PR for discussion.

But discussing here is OK too. Thanks guys for pushing this forward!

hinerm commented 9 years ago

@tpietzsch and I are talking about this at Janelia

hinerm commented 8 years ago

Steps so that @dietzc can use caching in ops tomorrow:

Success.

ctrueden commented 8 years ago

Personally, I like scijava-cache more than scijava-caching...

hinerm commented 8 years ago

@dietzc is https://github.com/scijava/scijava-common/commit/5893d52da3927c7fdcd394589cc6630aba85e32e ok?