Closed zph closed 7 years ago
@zph, thanks for adding this. At some point I realized that it would be possible for 2 or more updates on the same _id to be included in a single batch. My original concern was that the Data would be nil on all but one of them. I will also fix the concurrency issue where the Data is shared in memory between the ops.
@zph, recently I added an optimization to gtm with regard to updates. Previously, all updates (when not using UpdateDataAsDelta) would go through the fetch path where the corresponding data would be retrieved and set on the op. Now, gtm distinguishes between different types of updates. Some updates replace the entire document and thus the oplog gets a full copy of the data. This is the case that was optimized to not cause a fetch but rather to simply use the data in the oplog. Other updates that use $set, $unset, $addToSet, etc are recorded in the oplog as $set and $unset deltas. In this case gtm will still use the fetch approach to get the full data.
In summary, updates like
db.test.update({_id: 1}, {foo: 2})
do less
work now in gtm, where updates like
db.test.update({_id: 1}, {$set: {foo: 2}});
do the same amount of work as before this commit.
Using this sha: 6cf4c1bdfabb42016672ab20352a9bad40e68598
We observed that occasionally data ops would come through GTM channel that had an empty op.Data.
We traced it down the circumstances where two operations in oplog happened in very close proximity
Timestamp(1489513643, 3) and Timestamp(1489513643, 5)
.The code responsible was here: https://github.com/rwynn/gtm/blob/6cf4c1bdfabb42016672ab20352a9bad40e68598/gtm.go#L197 where it was assumed that there's a 1-1 mapping between results from Mongo and
this.Entries
. By assigning based on Mongo results and keying off the id, it skipped the operation withTimestamp(1489513643, 3)
.It is fixed in the newest commit: Here: https://github.com/rwynn/gtm/blob/master/gtm.go#L218-L222 where the
byId[resultId]
is now a slice instead of a single *Op.So once I upgraded moresql to e400ebf9f055864f277fcb1471f851a4c112b603, that issue resolved.
At that point, I needed to tweak moresql to make a copy of the op as it comes from gtm's tail channel, otherwise there are concurrent map read/write errors. It might be worth considering sending new copies of
op.Data
for each result so that this is solved in GTM instead of implemented in client libs depending on GTM.That proposed change is represented here: https://github.com/rwynn/gtm/pull/8.
Feel free to close this issue once you read through. Since it's mostly fixed in master, I wanted to leave a troubleshooting record for other folks who might come along.