python-discord / metricity

Advanced metric collection for the Python Discord server
MIT License
42 stars 17 forks source link

Users table has all users set to in_guild=False on user upsert #157

Closed jchristgit closed 1 day ago

jchristgit commented 2 weeks ago

When metricity starts up and synchronizes the users, all users temporarily have their in_guild field set to false.

From looking at the code I am currently unsure how this can happen. As far as I understand, the complete upsert is run in a transaction that is committed at the end of each individual insert.

Demonstration as taken during a metricity restart (whilst User upsert: messages can be seen in the logs):

metricity=# SELECT in_guild, COUNT(*) FROM users GROUP BY in_guild;  -- During sync
 in_guild | count
----------+--------
 f        | 884763
(1 row)

metricity=# SELECT in_guild, COUNT(*) FROM users GROUP BY in_guild;  -- After sync
 in_guild | count
----------+--------
 f        | 494638
 t        | 390125
(2 rows)
wookie184 commented 2 weeks ago

Looks like it's currently written to set all to not in the guild, then set those in the guild to true after https://github.com/python-discord/metricity/blob/e68d83de9bb3027085c4f72e3da4699eefdc1546/metricity/exts/event_listeners/startup_sync.py#L39

jchristgit commented 2 weeks ago

ahhh, I missed that.

Could we maybe do the whole update operation in a transaction or would that bust some limits with regards to PostgreSQL's transaction memory?

Maybe alternatively we could collect the IDs of all users where we set it to true and then do a single (batch of) updates after the initial sync to set in_guild=False for anyone who's not in that set.

ChrisLovering commented 2 weeks ago

https://github.com/python-discord/metricity/pull/159 seems like a way we could fix this. I haven't been able to test that postgres is ok about serialising that many strings in a single query though.

ChrisLovering commented 2 weeks ago

sqlalchemy.exc.InterfaceError: (sqlalchemy.dialects.postgresql.asyncpg.InterfaceError) <class 'asyncpg.exceptions._base.InterfaceError'>: the number of query arguments cannot exceed 32767 will need to relook at this approach

ChrisLovering commented 2 weeks ago

My only thoughts on how to make the above PR work would be to create a TEMPORARY table with ON COMMIT DROP and somehow bulk insert all in-guild users into it.

Can't look into how to achieve that with SQLAlchemy right now though.

wookie184 commented 2 weeks ago

It might be easiest to select all in guild users from the database and then do db_in_guild - guild.members and mark them as not in guild with another query.

It's not quite as nice as doing it all on the database but in practice I don't think it makes a difference for our case.

jchristgit commented 1 week ago

It might be easiest to select all in guild users from the database and then do db_in_guild - guild.members and mark them as not in guild with another query.

Won't this run into the same parameter issue?

I have an alternative idea, although it might be a bit crazy:

First we sort the list of user IDs to update ascending. Then we move the data in chunks, with special handling for the first and last chunk to incorporate an update for anything outside our known range as well.

Pseudocode (unsure of the exact SQLAlchemy syntax):

async with async_session() as sess:
    total = len(in_guild_user_id_chunks)
    for idx, chunk in enumerate(in_guild_user_id_chunks):
        not_user_in_chunk = ~models.User.id.in_(chunk)
        user_below_upper_bound = models.User.id <= max(chunk)
        user_above_lower_bound = models.User.id >= min(chunk)

        if idx == 0:
            # Update the complete lower end we are aware of
            condition = (user_below_upper_bound & not_user_in_chunk)
        elif idx == total - 1:
            # Update the complete upper end we are aware of
            condition = (user_above_lower_bound & not_user_in_chunk)
        else:
            # Just update this batch
            condition = (user_below_upper_bound & user_above_lower_bound & not_user_in_chunk)

        await sess.execute(update(models.User).where(condition).values(in_guild=False))
        await sess.commit()

Things to consider:

wookie184 commented 1 week ago

My suggestion was to do the comparison against guild.members on the Python side rather than the database site, so basically what we currently do but only change the users we need to. Something like this:

        async with async_session() as sess:
            in_guild_users = await sess.execute(select(models.User).filter_by(in_guild=True)).scalars().all()
            guild_member_ids = {member.id for member in guild.members}

            for user in in_guild_users:
                if user.id not in guild_member_ids:
                    user.in_guild = False

            await sess.commit()
jchristgit commented 22 hours ago

Can confirm it works as expected now, no more dips are visible in statistics.

Thanks for the fix!

users