Revised Design for LocalModel

@davideast FYI

As of this SHA https://github.com/jeffbcross/offline-github/commit/ce4b68874028929b18723a73866ea9c7763e4757, the CacheModel is a factory that accepts any number of instantiated "sources" when constructing, and returns a queryable model that will read and write to the underlying sources, leaving it up to the consumer of the model to specify if some changes should only be persisted to one source, or if changes should be guaranteed to one source before propagating to another source.

Per some discussion with @IgorMinar, here is how we will change the API. These APIs may change when building a more general solution, but will provide the features we need for this Github client application. These are just my thoughts, some not very thoroughly-considered, presented here for discussion.

Call the model LocalModel instead of CacheModel
A developer should be able to write to LocalModel objects without thinking about how changes are managed between underlying sources.
Instances of LocalModel map 1:1 with a Lovefield Schema. Many instances for the same schema can co-exist for different parts of an application. (No change from existing implementation)
LocalModel will still represent a collection of "flat" objects
The LocalModel will use Lovefield implicitly, not requiring it to be passed in to the factory as an argument.
The LocalModel will optionally accept one additional source, which should be treated as a source of truth. In this application, the source will primarily be a GithubSource, communicating with the Github Rest API.
Writes to the LocalModel will be optimistic, not requiring confirmation from source of truth, primarily to support offline writes. If writes are verifiably unable to be saved, the local model should be updated with the representation from the server. (This will require a more robust design for a general implementation)
LocalModel should be observable, and changes to a model locally should update any other components observing the model.
There should be a standard query language, expressed as a JavaScript object (referred to as LMQuery. TBD, but something like this for querying open Github issues: { state: 'open', $limit: 10, $skip: 20 }
The LocalModel should provide a static synchronize method, which will write all pending changes to the server, and fetch updates from sources of truth. The method should return an observable which contains all pending changes from cache and source of truth. This means that the LocalModel must maintain a persisted registry of LocalModel instances (and their respective sources of truth), which must be persisted beyond memory, so that synchronization can happen when the page is re-opened (or in the background).
The LocalModel's queue of pending writes should contain the payload of the write, the source to handle the write, and the parameters necessary to configure the source for the write.
Synchronization will be triggered by something outside of LocalModel (i.e. a ServiceWorker)
The LovefieldDB will have an additional table of pending writes, which should enable LocalModel to decorate JavaScript representations of models with metadata about the synchronization state of a given model. The schema would be something like:

# name of table associated with model
table_ref: string
# this model's stringified value of the schema's primary key
pk: string
# i.e. githubSource
source_of_truth: string
# JSON stringified array of arguments to be applied to source factory
source_args: string
# JSON stringified query
query: string

The source interface should be as follows:

function SomeAPISource(proprietaryArgs) {}
SomeAPISource.prototype = {
  subscribe: function(query:LMQuery) {
    /** 
     * should subscribe to local model changes, and optionally subscribe to changes from server
     * in the case of the GithubAPISource, it will install a WebHook for some types, and will subscribe
     * to a firebase that will accept payloads from Github (i.e. issue comments)
    **/
    return observable;
  },
  update: function(LMQuery, modifications) {
    /**
     * Will apply update to all objects that match the query
     * "modifications" is an object that will contain only the properties/values to update
     **/

    //promise does not get resolved with value, merely used to indicate success or failure
    return promise;
  },
  insert: function(newModel) {
    //promise does not get resolved with value, merely used to indicate success or failure
    return promise;
  }
}

Few comments on the proposed design.

+1 on having LocalModel abstract away the interaction between source of truth and local copies.
"LocalModel should be observable". This can be implemented by having each LocalModel instance observing an entire table as follows var q = db.select().from(localModelTable); db.observe(q, onChangeCallback);
"There should be a standard query language". I understand that this is needed for the clients of LocalModel (not direct clients of Lovefield), but Lovefield already provides an expressive language (see SelectContext class and SelectBuilder helper). In other words maybe clients of LocalModel can directly pass Lovefield query objects through LocalModel's API instead of custom JS objects (and have them translated to Lovefield queries)? This would spare an extra level of abstraction.
Having an extra table of pending writes might also be redundant. Assuming that each Table schema is updated to have a "sync" boolean column, changes in sync state of a table could be detected as follows:

var q = db. select(). from(localModelTable). where(localModelTable.sync.eq(false)); db.observe(q, function(changeRecord) { // detect if rows.length == 0, means that the entire table was synced. }); This might not be enough though if other data is needed (is sync in progress? how many times it failed, etc).

Regarding existing offline-github demo, I noticed something weird. After all my data was synced, the Issues table, was populated such that objects are not flat, for example in IndexedDB inspector I see that "user" instead of holding the user foreign key, it holds the entire user object, same for "assignee". This seems like an error to me and will most likely lead to other errors if queries on those attributes are performed.

Also, we now have a new data type supported "object" which will allow you to store a non-flat object as an attribute via Lovefield (but it can't be used for any indices/predicates), see https://github.com/google/lovefield/commit/bc20a69bada61b2ae13767dee6546d15c60ac113. Mentioning this because I noticed this line https://github.com/jeffbcross/offline-github/blob/master/lovefield/github.yml#L220.

Lovefield already provides an expressive language

Seems more that it would create a leaky abstraction to do this, rather than remove a level of abstraction, since controllers would now need to work directly with Lovefield. It would also require a little more work to parse the query for the non-lovefield source. Do you agree?

Having an extra table of pending writes might also be redundant.

I agree, I don't really like the idea of another table, and I don't think so much data should need to be stored to be able to push the change. Though, I think for now it's reasonable to not think about service workers triggering the synchronization without an app being open, since AFAIK background synchronization isn't yet supported in any service worker implementation.

To that end, perhaps each LocalModel instance should be 1:1 with a Lovefield database instance, and a single optional source of truth. E.g. I would have a service called "Github" that would return a LocalModel instance. The service would expose a method like Github.getCollection('Issues') which would return a collection type that would provide the crud methods. When the app starts up, it should call Github.synchronize() in the run block, which should synchronize all collections in the github database, starting with pushing local writes, then retrieving updates.

LocalModel should immediately attempt to synchronize individual changes to the source of truth as they are written, and should employ a configurable retry mechanism using exponential backoff, allowing setting the number of attempts before giving up until the next manual synchronization.

# enumerable state such as 0=no attempt,1=pending,2=synced,3=failed
_sync_state: number
# enumerable such as 0=create,1=read,2=update,3=delete
_sync_operation: number
# http error code
_sync_error_code: number

What do you think?

Seems more that it would create a leaky abstraction to do this, rather than remove a level of abstraction, since controllers would now need to work directly with Lovefield. It would also require a little more work to parse the query for the non-lovefield source. Do you agree?

Yes and no. Controllers would work with Lovefield's schema and query building layer (which is the outer-most layer), but not with the rest of Lovefield (for example a Controller would not call exec() on a Lovefield query object). I agree that there is some leakage here, but it is sort of contained. Regarding parsing the query for non-Lovefield sources it can be addressed in various ways. The most obvious (but probably not the cheaper one) is to utilize Lovefield's toSql() method, and have non-Lovefield sources parse the SQL string for that query. Alternatively, parsing could be avoided completely, given that lf.query.SelectContext class represents structured data (it has "from", "where", "orderBy" fields etc), so some translation would be needed but I would not qualify it as "parsing".

In any case, the more I think about this, I don't see a clear winner (new query language vs re-use Lovefield's schema and query building), so I will step back for a sec and question why a "standard query language" is the way to go here?

How about adding declarative methods where needed? For example there will be a LocalModel interface and various implementations (IssuesLocalModel, RepositoriesLocalModel etc). Once a LocalModel has been retrieved, it could be down-casted to the appropriate type, say IssuesLocalModel (admittedly not very elegant design-wise), which would expose methods that might be specific only to the "Issue" table (for example getAllOpenIssues()), but there would be no "standardized query language" exposed from LocalModel interface.

Regarding the "Github" service and the "remote" Collection instances, back-off strategies, SGTM.

The thinking behind a query object would be something that captures the lowest common denominator between Lovefield and other sources. Ie maybe Lovefield supports something that another source might not support out-of-the-box (like joins). Eventually, beyond this project, the LocalModel design should be able to support other local cache sources with different querying semantics.

I think the model-specific methods at this point are something that can be handled in an application-specific layer, and then can be considered for inclusion in the LocalModel API later.

I think we're in enough agreement that I could move forward with an implementation, don't you think @freshp86 and @davideast?

Yes, the plan sounds good to me, feel free to move forward.

As discussed in some emails and chats between myself and @freshp86, I've been experimenting with a less abstracted approach to the problem. There is only one view of the application, where a user can input an owner/repo pair, and see the issues for that view, paginated. When a new owner/repo is entered, the controller queries lovefield directly (no LocalModel abstraction) for issues for that repo, then tells a synchronization web worker to fetch all of the other issues for that repo. The web worker reports back when the total count changes so the UI can update the number of pages available.

Despite significant performance gains of running the synchronization in a web worker (faster execution, as well as less work in UI thread), there are drawbacks to having different lovefield instances in the ui thread and a web worker. Most notably, lovefield caches a lot of information to memory, and doesn't know that the instance in another thread has been inserting data and should therefore update.

So it's apparent that in the absence of changing Lovefield to be able to manually invalidate objects, the application needs a single database instance. Then the question becomes where should the instance live: in a web worker, or in the ui thread?

To keep the querying and database initialization in a separate thread from the ui thread would be ideal from a user experience perspective, since operations range from 100s-1000s of milliseconds. But the chief bottleneck of delegating tasks to Web Workers is marshaling (copying) objects back and forth between the worker thread and other threads. There is also a developer ergonomics issue of having to create a query DSL that can delegate constructing real lovefield queries to the lovefield worker.

So in this application, what is the cost of posting messages of database results to and from the worker? Issues are the main object in this app, and issues are usually queried in increments of 30, and inserted in increments of <100. 30 issues (not wrapped as lovefield rows) take about 64k of memory, so 100 would theoretically be 211. According to this benchmark, comparing copying of 1MB of data and 32MB of data, posting a message of 1MB takes about 3-5ms in FF and Chrome, and about 20ms on mobile safari. The comparison of the two times makes the cost seem somewhat linear, which would mean that marshaling 64k of a list of issues would be in the sub-millisecond range, and 211k about 1ms. It would be nice to measure this further in this application.

So the cost of worker message posting for this application seems very reasonable. And from measuring the time of inserting 100 rows of data (110ms), for about 100 inserts for projects like angular/angular.js (10k+ issues), moving the insertion into a separate thread can shift 11,000ms away from the UI thread.

Additionally, the cost of fetching 30 issues from lovefield in the UI thread or a worker comes to about 315ms currently in Chrome. To shift this to a web worker would shift all of that time off of the main thread, and would only cost about 0.22ms.

In addition to runtime considerations, there's also a consideration for how long it takes to generate a database instance when the application is bootstrapping. I measured in the main thread of my application, it takes about 1240ms to generate the instance (with about 11,000 issues in it), and about 870ms in a worker to generate an instance of the same database. Though the measurement isn't perfect, it's consistent with other speed increases I've observed from moving operations to a worker. For example, I noticed a significant increase in synchronization speed when the whole process was moved to a worker, despite moving to a slower network). In addition to speeding up the operation, it's also work that could impede the user experience if left in the UI thread.

Anyway, that's all just to say that I'm going to experiment with running all lovefield operations within a web worker, and am going to measure the performance of database operations and message passing operations.

jeffbcross / offline-github

Revised Design for LocalModel #4