mastodon-sc / mastodon

Mastodon – a large-scale tracking and track-editing framework for large, multi-view images.
BSD 2-Clause "Simplified" License
67 stars 20 forks source link

Feature serialization and incremental updates. #104

Closed tinevez closed 5 years ago

tinevez commented 5 years ago

Dear Tobias,

Here is my latest attempt at getting the feature serialization right. For several of the projects Mastodon uses and that I could witness or pilot, it turned to be an important feature of the scientific workflows we want to address.

I try to put this PR in a shape that can be recycled for the documentation or the Materials and Methods of the future paper. Yet, it is in the current shape addressed to you.

Feature serialization and incremental updates.

Mastodon has the central ambition to harness the possibly very large data generated when analyzing large images. The Mastodon Feature framework offers numerical (also non numerical) data defined for the objects of the Mastodon-app model, that can be defined and created by third-parties developers. It should abide to Mastodon ambition and facilitate harnessing large data too.

Feature serialization.

Some feature can take very long to compute on large models, e.g. the spot gaussian-filtered intensity. Roughly speaking, a batch of 1000 spots takes about a second to compute. The time invested in computing these feature values should not be lost when the model is reloaded in a later session, so the Mastodon-app must offer serialization of feature values along with the model.

I remember that in my first attempt you did not like that the feature computer or the feature classes themselves had information and/or methods related to serialization. To design the current solution I simply adapted the design we already have for feature computers.

Feature serialisers.

The central interface for classes that can de/serialise a feature is org.mastodon.feature.io.FeatureSerializer. It is typed with the feature type and feature target type (vertex or edge):

public interface FeatureSerializer< F extends Feature< O >, O > extends SciJavaPlugin

And it is a SciJavaPlugin because of course we want to feature serializers discoverable by a specialized service, like for feature computers. The interface defines 3 methods:

public FeatureSpec< F, O > getFeatureSpec();

because we want to know the spec of the feature we can de/serialize. Also there are de/serialization methods based on object streams:

public void serialize( F feature, ObjectToFileIdMap< O > idmap, ObjectOutputStream oos );

public F deserialize( final FileIdToObjectMap< O > idmap, final RefCollection< O > pool, ObjectInputStream ois );

Note that the deserialize method returns the exact feature type, and not a super-class. We want serializers to produce an exact feature of the right class. So we will need one feature serializer for every feature we define. This means we are not able to write generic serializers for generic features.

Here is how a serializer looks like for a simple feature like SpotNLinksFeature:

@Plugin( type = SpotNLinksFeatureSerializer.class )
public class SpotNLinksFeatureSerializer implements FeatureSerializer< SpotNLinksFeature, Spot >
{

    @Override
    public Spec getFeatureSpec()
    {
        return SpotNLinksFeature.SPEC;
    }

    @Override
    public void serialize( final SpotNLinksFeature feature, final ObjectToFileIdMap< Spot > idmap, final ObjectOutputStream oos ) throws IOException
    {
        final IntPropertyMapSerializer< Spot > propertyMapSerializer = new IntPropertyMapSerializer<>( feature.map );
        propertyMapSerializer.writePropertyMap( idmap, oos );
    }

    @Override
    public SpotNLinksFeature deserialize( final FileIdToObjectMap< Spot > idmap, final RefCollection< Spot > pool, final ObjectInputStream ois ) throws IOException, ClassNotFoundException
    {
        final IntPropertyMap< Spot > map = new IntPropertyMap<>( pool, -1 );
        final IntPropertyMapSerializer< Spot > propertyMapSerializer = new IntPropertyMapSerializer<>( map );
        propertyMapSerializer.readPropertyMap( idmap, ois );
        return new SpotNLinksFeature( map );
    }
}

Nothing special; we reuse the property map serializers you made for the model serialization. Note however that the serializer must have access to the said property map for serialization (default visibility of the final field map) and to a constructor that accepts a property map for deserialization (also default visibility). Note the annotation with @Plugin; this is how the feature serializer service will pick it up.

This way we resemble the FeatureComputer framework a lot. The cool thing is that a feature does not have to know it is serializable. And I could make most of the Mastodon-app features serializable without modifying the feature class. Now a feature with a computer and a serializer looks like this in Eclipse:

SpotNLinksEclipseDeclaration

The feature serializer service.

Again, we emulate what we have for the feature computation, but simpler. There is a FeatureSerializationService interface that as a default implementation in DefaultFeatureSerializationService. Both of them are generic in terms of target object class.

The interface defines a single method

@Override
public FeatureSerializer< ?, ? > getFeatureSerializerFor( final FeatureSpec< ?, ? > spec );

that returns a feature serializer for a given feature specification. That's it.

Feature serialization in Mastodon-app.

Now, let's orchestrate serialization and deserialization of features for the Mastodon-app, along with the Model serialization.

Serialization.

I modified the ProjectWriter interface in the MamutProject class so that it now has a new method:

OutputStream getFeatureOutputStream( String featureKey ) throws IOException;

This method should return a new output stream for the feature with the specified key. In practice, both the folder and the zip versions of the writer creates a folder features, and store feature data in files named with the feature key appended with .raw. For instance after serialization you will find the following in a .mastodon zip file or in a project folder:

$ ls -1
    features/
    model.raw
    project.xml
    tags.raw
$ ls -1 features/
    Link displacement.raw
    Link velocity.raw
    Spot N links.raw
    Spot gaussian-filtered intensity.raw
    Spot track ID.raw
    Track N spots.raw

The actual serialization logic appends in the class MamutRawFeatureModelIO and requires

Because of these arguments, this method is called in the ProjectManager class, in the saveProject(File projectRoot) method.

Also because we need the GraphToFileIdMap< Spot, Link >, I changed the model.saveRaw( writer ) method to return it instead of void. I could not find a way to calls the feature serialization logic directly in this method, mainly because we need the SciJava Context, which is not a field of the Model class. Another possibility would be to pass the FeatureSerializationService to the saveRaw() method.

Deserialization.

The deserialization happens in a similar way, but have extra logic.

First we need to know what feature to deserialize. So the ProjectReader interface has a new method

Collection< String > getFeatureKeys();

that returns the keys of the features saved in the project. From each of these keys, the ProjectReader can generate an input stream with the method

InputStream getFeatureInputStream( String featureKey ) throws IOException;

​ The deserialization logic also happens in the MamutRawFeatureModelIO class and is also called from the ProjectManager class. Here is how:

  1. First we get a FeatureSerializationService.
  2. From it, we get the list of feature keys to deserialize.
  3. Then we get a FeatureSpecsService to retrieve feature specs from feature keys.
  4. For each of the feature keys in the project, we try to get a feature spec.
  5. If we can, we try to get a FeatureSerializer for the spec.
  6. If we can, we get the target class of the feature. Depending on whether it is a Spot feature or an Edge feature, we pass the correct FileIdToObjectMap< O > instance to the serializer.
  7. It returns the deserialized instance of the right class, that we declare in the FeatureModel.

That's it.

Example.

The file org.mastodon.mamut.feature.SerializeFeatureExample in src/test/java gives an example of serialization / deserialization of features.

Incremental updates during feature computation.

Serializing features is a good first step to conveniently harness large data. Thanks to serialization, the time spent on feature computation is not lost when we save and reload the data. However it is not good enough alone.

Mastodon is not limited to be a viewer of large data but allows for editing the model at any scale. You can run the detection and linking algorithms and create a large amount of data. But you can also make single spot editing and for instance change the position of one spot within a model made of several millions of them.

The possibility to edit single vertices or edges - or point-wise editing - in Mastodon creates a challenge for feature computation. Contrary to TrackMate, the Mastodon-app does not keep the features in sync with point-wise editing. Feature computation must be triggered by the user manually. This is a choice we made based on out experience with TrackMate, that becomes much less responsive when editing large models. Therefore, as soon as we make an edit, the feature values becomes invalid until the user recomputes them. This is fine, but if the model is very large, we will then need to spent a large amount of time on feature computation, while in reality possibly a small number of objects have changed. The feature incremental update mechanism aims at solving this problem.

To rely on incremental update, a feature computer has to declare a dependency on a special feature, that returns the collection of vertices or edges that changed since the last time the feature was computed. This way, it can only process these objects and not the full model.

I give some details of the incremental update mechanisms below, in reverse order of they work:

  1. First how it is used in feature computers that want to use incremental updates.
  2. A word of caution about using incremental feature update with features that can be serialized.

These first two paragraphs are good enough for developers that want to implement their own feature computer based on incremental update. The following two gives information about how the incremental update mechanism works in the Mastodon-app.

  1. How the update stack objects are built when the user edit the model and trigger feature computation.

  2. How the Mastodon-app link listeners to model changes to the update stacks to properly built it.

Using incremental updates in a feature computer.

Roughly speaking a feature computer that has incremental update looks like the following. It has a dependency on a special input, declared with SciJava @Parameter annotation:

@Parameter
private SpotUpdateStack stack;

This update contains the collection of spots that changed since the last feature computation. A similar class exists for links. To get the changes for our feature, we need to call:

Update< Spot > changes = stack.changesFor( FEATURE_SPEC );

Where FEATURE_SPEC is the feature specifications object. If the value of changes is null, then the feature value must be recomputed for the full model. If not, changes can be retrieved and used for computation. The Update class has two main public methods:

public RefSet< O > get();

public RefSet< O > getNeighbors();

The get() method returns the collection of objects that were modified (created, moved or edited). The getNeighbors() returns the collection of the neighbors of the objects that were modified. For instance, if you move a spot:

It is up to the feature computer to decide how to use these collections to recompute values. For instance, a generic feature computer that uses the incremental computation mechanism for spot would work like this for instance:

@Override
public void run()
{
  final Update< Spot > changes = stack.changesFor( SPEC );
  final Iterable< Spot > vertices = ( null == changes )
    ? model.getGraph().vertices()
    : changes.get();

  // Compute values.
  for ( final Spot s : vertices )
  {
    double val = ...
    output.map.set( s, val );
  }
}

The SpotGaussFilteredIntensityFeatureComputer is an example of such a feature computer (its logic is a bit more complicated because it sorts the spots per frame before computation).

Incremental update and feature serialization.

Note that the feature computers that use incremental feature update must operate always on the same feature instance. Lest the feature exposed in the feature model would only contain values for the last incremental update.

So special care must be taken when using incremental update on features that can be serialized. The feature deserialization will produce a new instance of the feature, and the feature computer has a method #createOutput() that can too, resulting in a conflict. For this reason, it is wise in the feature computer #createOutput() method to check if an instance of the desired feature exists already in the feature model.

For instance in the SpotGaussFilteredIntensityFeatureComputer we have:

@Override
public void createOutput()
{
    if ( null == output )
    {
        //  Try to get it from the FeatureModel, if we deserialized a model.
        final Feature< ? > feature = model.getFeatureModel().getFeature( SpotGaussFilteredIntensityFeature.SPEC );
        if (null != feature )
        {
            output = ( SpotGaussFilteredIntensityFeature ) feature;
            return;
        }
    // Create a new one.
    // ...  
        output = new SpotGaussFilteredIntensityFeature( means, stds );
    }
}

The same caution must be applied to feature that are not computed (with a FeatureComputer) but updated elsewhere in other processes. For instance the DetectionQualityFeature (in mastodon-tracking) keeps track of the quality value of the spots that are created by a SpotDetectorOp. It is serializable and therefore has a static method that works similarly:

public static final DetectionQualityFeature getOrRegister( final FeatureModel featureModel, final RefPool< Spot > pool )
{
  final DetectionQualityFeature feature = new DetectionQualityFeature( pool );
  final DetectionQualityFeature retrieved = ( DetectionQualityFeature ) featureModel.getFeature( feature.getSpec() );
  if ( null == retrieved )
  {
    featureModel.declareFeature( feature );
    return feature;
  }
  return retrieved;
}

Building an incremental update stack.

In this paragraph we explain how the SpotUpdateStack and LinkUpdateStack are built as feature computation happens. We assume that they are wired to listeners that update them with model changes, and we will describe how it is done in the next paragraph.

A difficulty in returning the right changes for a feature computer, is that in Mastodon-app the user is free to select what features he wants to be computed. So the feature model can be up-to-date for some features, and not for others. So when we call

Update< Spot > changes = stack.changesFor( FEATURE_SPEC );

in the feature computer, the update stack must return the collection of spots that were changed or added in the model since the last time the feature with specs FEATURE_SPEC was calculated. For instance, consider a model made of only two spots s1 and s2, for which two features named A and B can be computed, both using the incremental update mechanism:

Building incremental updates

The update stack is the core component of the incremental update mechanism. It is made of a stack of update items. Each update item works like a map from a collection of feature specs as key, and a collection of graph objects (vertices for the SpotUpdateStack) that were modified or added since the features were calculated as value. In reality it is more a Pair than a Map but I use this vocable here.

On the example described in the drawing above, at t0 the features are not computed. The update stack is initialized with a single item with empty key and empty value.

At t1, the user triggers a computation of both features A and B. Because none of the feature specs for A and B can be found in the update stack, the #changesFor( FEATURE_SPEC ) method returns null, which triggers a computation of the feature values for all the spots of the model.

At t2 once the computation is finished, a new update item is pushed on stack. It is initialized with an empty collection for value, and the specs of the features that were calculated (A and B) are stored as key.

At t3 the user moves the spot s1. Because there are some listeners wired to the update stack, s1 is added to the value collection of the top item in the stack. That is: the one with A and B specs as key that was created after A and B computation. Both features A and B are marked as not up-to-date.

At t4 the user triggers the computation of feature A only. Since the feature computer for A uses incremental feature update, it queries the changes for its feature. A call for #changesFor( A ) does the following:

At t5, after computation, a new update item is added to the update stack. Since we computed only A, its key contains only A specs. As before, the item is initialized with an empty collection as value. A is marked up-to-date.

At t6 the user moves the spot s2. As before, it is added to the value collection of the top item in the stack. This time, it is the one with A specs as key.

Now at t7 the user wants to compute feature B, which is not up-to-date since t2. Again, the feature computer for B uses incremental feature computation. The call to #changesFor( B ) does the following:

After this computation (t8) a new update item is pushed to the update stack, with B specs as key.

And it goes on like this. If after the steps exemplified here the user would recompute all features, the changes for B would be empty, and the changes for A would be built by iterating to the second item in the stack, that contains only s2.

The stack itself has a limited capacity. It can stores 10 update items, after that the old items are discarded. This results in triggering the full computation for 'forgotten' feature updates.

Registering the incremental update in feature computation.

The following describes how we provide the changes to the feature computer. We give it specifically for the Mastodon-app (Spot and Link).

When the MamutFeatureComputerService receives the Model instance to operate on, first the update stacks are created:

FeatureModel featureModel = model.getFeatureModel();
SpotUpdateStack spotUpdates = SpotUpdateStack.getOrCreate( featureModel, graph.vertices() );
LinkUpdateStack linkUpdates = LinkUpdateStack.getOrCreate( featureModel, graph.edges() );

They are created via static methods getOrCreate(), and we will explain why later.

We then add several listeners to the graph:

ModelGraph graph = model.getGraph();
graph.addGraphListener( GraphFeatureUpdateListeners.graphListener( spotUpdates, linkUpdates, graph.vertexRef() ) );
// Listen to changes in spot properties.
SpotPool spotPool = ( SpotPool ) graph.vertices().getRefPool();
PropertyChangeListener< Spot > vertexPropertyListener = GraphFeatureUpdateListeners.vertexPropertyListener( spotUpdates, linkUpdates );
spotPool.covarianceProperty().addPropertyChangeListener( vertexPropertyListener );
spotPool.positionProperty().addPropertyChangeListener( vertexPropertyListener );

These listeners are defined in the GraphFeatureUpdateListeners and consist of listeners that feed changes of the graph to the two update stacks. For instance, the one that listens to vertex properties is as follow:

private static final class MyVertexPropertyChangeListener< V extends Vertex< E >, E extends Edge< V > > implements PropertyChangeListener< V >
{

    private final UpdateStack< V > vertexUpdates;

    private final UpdateStack< E > edgeUpdates;

    public MyVertexPropertyChangeListener( final UpdateStack< V > vertexUpdates, final UpdateStack< E > edgeUpdates )
    {
        this.vertexUpdates = vertexUpdates;
        this.edgeUpdates = edgeUpdates;
    }

    @Override
    public void propertyChanged( final V v )
    {
        vertexUpdates.addModified( v );
        for ( final E e : v.edges() )
            edgeUpdates.addNeighbor( e );
    }
}

This ensures that the two update stacks will receive the changes.

Also, the update stack class that is the super class for the spot and link update stacks implements Feature< O >:

public abstract class UpdateStack< O > implements Feature< O >

This will be important for serialization, as we will see below. It also has the advantage that we do not have to do anything special to provide it to feature computers. Since it is a feature, it will be provided to feature computers that declare it as a dependency as any other feature.

Note that the feature computers that use incremental feature update must operate always on the same feature instance. Lest the feature exposed in the feature model would only contain values for the last incremental update.

So special care must be taken when using incremental update on features that can be serialized. The feature deserialization will produce a new instance of the feature, and the feature computer has a method #createOutput() that can too, resulting in a conflict. For this reason, it is wise in the feature computer #createOutput() method to check if an instance of the desired feature exists already in the feature model.

For instance in the SpotGaussFilteredIntensityFeatureComputer we have:

@Override
public void createOutput()
{
    if ( null == output )
    {
        //  Try to get it from the FeatureModel, if we deserialized a model.
        final Feature< ? > feature = model.getFeatureModel().getFeature( SpotGaussFilteredIntensityFeature.SPEC );
        if (null != feature )
        {
            output = ( SpotGaussFilteredIntensityFeature ) feature;
            return;
        }
    // Create a new one.
    // ...  
        output = new SpotGaussFilteredIntensityFeature( means, stds );
    }
}

The same caution must be applied to feature that are not computed (with a FeatureComputer) but updated elsewhere in other processes. For instance the DetectionQualityFeature (in mastodon-tracking) keeps track of the quality value of the spots that are created by a SpotDetectorOp. It is serializable and therefore has a static method that works similarly:

public static final DetectionQualityFeature getOrRegister( final FeatureModel featureModel, final RefPool< Spot > pool )
{
  final DetectionQualityFeature feature = new DetectionQualityFeature( pool );
  final DetectionQualityFeature retrieved = ( DetectionQualityFeature ) featureModel.getFeature( feature.getSpec() );
  if ( null == retrieved )
  {
    featureModel.declareFeature( feature );
    return feature;
  }
  return retrieved;
}

Serialization of the incremental update state.

The incremental update mechanism leads to new programming challenges along with feature serialization.

Indeed, the feature computation is triggered by the user on demand. So the feature values might not be up-to-date with the model when we serialize them. In such cases, the features have to be recomputed for the whole model after reloading. This voids the advantage brought by the incremental feature computation: We loose the benefit of recomputing features for just the spots and links that have been modified after reloading, and the time spent on computing is lost after saving the model.

The solution to this is evident: we have to serialize the update stacks along with the model and the feature values. This is what is done currently when the model is saved. We give here details about how this happens.

Update stacks objects are features.

SpotUpdateStack and LinkUpdateStack both inherit from UpdateStack< O > which implements Feature< O >. So these objects are features. They have however 0 feature projections and cannot be used in a feature color mode and are not displayed in the data table. They are features because this makes it very convenient to:

We could give them meaningful projections. For instance - and that would be helpful for debugging - we could have one projection per update item in the stack that returns an int number whether a spot or link is marked as changed, neighbor or changed or not changed. But for now, they are stowaway features of the feature model.

both have matching serializers that inherit from UpdateStackSerializer< F, O>

Serialization of update stack objects.

Because they are features, we just need to provide an implementation of FeatureSerializer for them, that will be handled automatically by the feature serialization service described in the first section of this document.

There exists for convenience an abstract class UpdateStackSerializer:

public abstract class UpdateStackSerializer< F extends UpdateStack< O >, O > implements FeatureSerializer< F, O >

It handles the serialization of any subclass. The UpdateStack has a unique field to serialize: the stack of update items itself. So we serialize in order:

A project saved with update stacks looks like this on disk:

$ ls -1 features/
    Link displacement.raw
    Link velocity.raw
    Spot N links.raw
    Spot gaussian-filtered intensity.raw
    Spot track ID.raw
    Track N spots.raw
    Update stack Link.raw
    Update stack Spot.raw

Deserialization of update stack objects.

Deserialization happens in reverse, but because concrete serializers have to return a new instance of the right class of UpdateStack implementation, the UpdateStackSerializer is abstract and offers instead a method:

protected SizedDeque< UpdateState< O > > deserializeStack(...)

which returns a new stack of update item, that can be used to instantiate the right class.

In the update items we use FeatureSpecs as key.s Notice that we do not serialize nor deserialize the true class of each FeatureSpec but a generic one made of the right field (key, info, multiplicity, …). Since in incremental updates we only use the #equals() method of FeatureSpec, and that it is based on some of its fields, this is ok. But because this is only true for incremental update, the FeatureSpec serialization only has package visibility.

JUnit tests.

In org.mastodon.feature.update of src/test/java there are two JUnit tests that serialize a model with pending changes for incremental feature computation, and test for proper computation after reloading the model.

tinevez commented 5 years ago

Superseded by #106