rwynn / gtm

gtm (go tail mongo) is a MongoDB event listener
MIT License
146 stars 33 forks source link

Missing op.Data when alterations to same ns + id happen within same Timestamp Epoch second #9

Closed zph closed 7 years ago

zph commented 7 years ago

Using this sha: 6cf4c1bdfabb42016672ab20352a9bad40e68598

We observed that occasionally data ops would come through GTM channel that had an empty op.Data.

We traced it down the circumstances where two operations in oplog happened in very close proximity Timestamp(1489513643, 3) and Timestamp(1489513643, 5).

The code responsible was here: https://github.com/rwynn/gtm/blob/6cf4c1bdfabb42016672ab20352a9bad40e68598/gtm.go#L197 where it was assumed that there's a 1-1 mapping between results from Mongo and this.Entries. By assigning based on Mongo results and keying off the id, it skipped the operation with Timestamp(1489513643, 3).

It is fixed in the newest commit: Here: https://github.com/rwynn/gtm/blob/master/gtm.go#L218-L222 where the byId[resultId] is now a slice instead of a single *Op.

So once I upgraded moresql to e400ebf9f055864f277fcb1471f851a4c112b603, that issue resolved.

At that point, I needed to tweak moresql to make a copy of the op as it comes from gtm's tail channel, otherwise there are concurrent map read/write errors. It might be worth considering sending new copies of op.Data for each result so that this is solved in GTM instead of implemented in client libs depending on GTM.

for _, o := range mapped {
    data := make(map[string]interface{})
    // Create copy in case of multiple ops using same mongo result
    for k, v := range result {
        data[k] = v
    }
    o.Data = data
}

That proposed change is represented here: https://github.com/rwynn/gtm/pull/8.

Feel free to close this issue once you read through. Since it's mostly fixed in master, I wanted to leave a troubleshooting record for other folks who might come along.

rwynn commented 7 years ago

@zph, thanks for adding this. At some point I realized that it would be possible for 2 or more updates on the same _id to be included in a single batch. My original concern was that the Data would be nil on all but one of them. I will also fix the concurrency issue where the Data is shared in memory between the ops.

rwynn commented 7 years ago

@zph, recently I added an optimization to gtm with regard to updates. Previously, all updates (when not using UpdateDataAsDelta) would go through the fetch path where the corresponding data would be retrieved and set on the op. Now, gtm distinguishes between different types of updates. Some updates replace the entire document and thus the oplog gets a full copy of the data. This is the case that was optimized to not cause a fetch but rather to simply use the data in the oplog. Other updates that use $set, $unset, $addToSet, etc are recorded in the oplog as $set and $unset deltas. In this case gtm will still use the fetch approach to get the full data.

In summary, updates like

db.test.update({_id: 1}, {foo: 2})

do less work now in gtm, where updates like

db.test.update({_id: 1}, {$set: {foo: 2}});

do the same amount of work as before this commit.