matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.75k stars 675 forks source link

State key NID mappings can be missing #2094

Open kegsay opened 2 years ago

kegsay commented 2 years ago

`error="found 33449 users but only have state key nids for 33447 of them"``

Caused by roomserver code in storage: JoinedUsersSetInRooms:

func (d *Database) JoinedUsersSetInRooms(ctx context.Context, roomIDs []string) (map[string]int, error) {
    roomNIDs, err := d.RoomsTable.BulkSelectRoomNIDs(ctx, roomIDs)
    if err != nil {
        return nil, err
    }
    userNIDToCount, err := d.MembershipTable.SelectJoinedUsersSetForRooms(ctx, roomNIDs)
    if err != nil {
        return nil, err
    }
    stateKeyNIDs := make([]types.EventStateKeyNID, len(userNIDToCount))
    i := 0
    for nid := range userNIDToCount {
        stateKeyNIDs[i] = nid
        i++
    }
    nidToUserID, err := d.EventStateKeysTable.BulkSelectEventStateKey(ctx, stateKeyNIDs)
    if err != nil {
        return nil, err
    }
    if len(nidToUserID) != len(userNIDToCount) {
        return nil, fmt.Errorf("found %d users but only have state key nids for %d of them", len(userNIDToCount), len(nidToUserID))
    }

It shouldn't be possible to have NIDs in the membership table but not in the state keys table, but apparently it is.

kegsay commented 2 years ago

So the place where InsertMembership is called is in membershipUpdaterTxn which is called just after targetUserNID, err = d.assignStateKeyNID(ctx, txn, targetUserID) when NewMembershipUpdater is made. However, assignStateKeyNID hits a cache first:

func (d *Database) assignStateKeyNID(
    ctx context.Context, txn *sql.Tx, eventStateKey string,
) (types.EventStateKeyNID, error) {
    if eventStateKeyNID, ok := d.Cache.GetRoomServerStateKeyNID(eventStateKey); ok {
        return eventStateKeyNID, nil
    }
    // Check if we already have a numeric ID in the database.
    eventStateKeyNID, err := d.EventStateKeysTable.SelectEventStateKeyNID(ctx, txn, eventStateKey)
    if err == sql.ErrNoRows {
        // We don't have a numeric ID so insert one into the database.
        eventStateKeyNID, err = d.EventStateKeysTable.InsertEventStateKeyNID(ctx, txn, eventStateKey)
        if err == sql.ErrNoRows {
            // We raced with another insert so run the select again.
            eventStateKeyNID, err = d.EventStateKeysTable.SelectEventStateKeyNID(ctx, txn, eventStateKey)
        }
    }
    if err == nil {
        d.Cache.StoreRoomServerStateKeyNID(eventStateKey, eventStateKeyNID)
    }
    return eventStateKeyNID, err
}

If it was possible for it to be in the cache but not the database, this would cause this error. Given this is a rare condition to hit, I wouldn't be surprised if the race logic here doesn't work as intended. If that were the case then the flow could be: