sopel-irc / sopel

:robot::speech_balloon: An easy-to-use and highly extensible IRC Bot framework. Formerly Willie.
https://sopel.chat
Other
950 stars 405 forks source link

Race Condition With Database Writes (`set_nick_value()`) #2479

Open ghost opened 1 year ago

ghost commented 1 year ago

Description

If two commands try to write a nick value (set_nick_value()) for the a nickname at the same time, before that nickname exists in the database, you will end up with multiple nick_ids for one nicknames.

Reproduction steps

  1. Install Sopel (doesn't matter if stable or from git).
  2. Create a basic enough default.cfg so that Sopel will connect to your desired server and channel.
  3. Add test.py (see below) to your /plugins subfolder
  4. Start Sopel
  5. Type any message into chat and you will either get an error in chat or in your console (or both!).
  6. Investigate your database and you will see in the nicknames table that one nick has two nick_ids associated with it.

test.py:

from sopel import plugin

@plugin.rule(r".*")
def test_break(bot, trigger):
    bot.db.set_nick_value(trigger.nick, 'test_value', 0)

Expected behavior

There should only be one nick_id created for one nickname.

Relevant logs

[2023-06-19 16:26:44,218] sopel.bot            ERROR    - Unexpected MultipleResultsFound (Multiple rows were found when one or none was required) from test at 2023-06-19 21:26:44.218692+00:00. Message was: test
Traceback (most recent call last):
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/bot.py", line 703, in call_rule
    rule.execute(sopel, trigger)
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/plugins/rules.py", line 1267, in execute
    exit_code = self._handler(bot, trigger)
  File "/home/test/bots/testbot/sopel/plugins/test.py", line 6, in test_break
    bot.db.set_nick_value(trigger.nick, 'test_value', 0)
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/db.py", line 466, in set_nick_value
    nick_id = self.get_nick_id(nick, create=True)
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sopel/db.py", line 378, in get_nick_id
    ).scalar_one_or_none()
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sqlalchemy/engine/result.py", line 1225, in scalar_one_or_none
    return self._only_one_row(
  File "/home/test/bots/testbot/lib/python3.10/site-packages/sqlalchemy/engine/result.py", line 614, in _only_one_row
    raise exc.MultipleResultsFound(
sqlalchemy.exc.MultipleResultsFound: Multiple rows were found when one or none was required
[2023-06-19 16:26:44,219] sopel.modules.seen   ERROR    - Unable to save seen, database error: Multiple rows were found when one or none was required

Notes

Make sure you're testing with with a fresh database or using nicks your bot has never seen before.

Sopel version

v8.0.0.dev0

Installation method

pip install

Python version

3.10.12

Operating system

No response

IRCd

No response

Relevant plugins

seen.py

RustyBower commented 1 year ago

I'm not able to repro this with the latest master and python 3.10.6

RustyBower commented 1 year ago

Ah. After completely blowing away the database, I can repro this.

ghost commented 1 year ago

@RustyBower: you're testing either with a fresh database or a nick your bot's never seen before?

He is now. :P

dgw commented 1 year ago

Might be reasonable for "this is a new nick I haven't seen before" to be handled somewhere in core before plugins ever have a chance to trigger, maybe a @plugin.priority('high') + @plugin.unblockable handler in coretasks. There's already a JOIN handler that would be a natural place to do this, and it's already intended to run before any plugins get their mitts on the trigger.

(Yes, it'd still be possible for plugins to intentionally attempt to have their code run before coretasks—we even have a documentation tip about handling events before Sopel does. But making it intentional, not accidental, would be a big step all the same.)

SnoopJ commented 1 year ago

I can reproduce this on both 3.10 and 3.7. This is a TOCTOU issue in set_nick_value().

A user plugin should be able to avoid this issue in 7.1.9 or against development (7bcb942) by adding @plugin.thread(False) to their handlers, which forces the transaction onto the same thread (main) as the seen plugin. @wkeiuluf this may be useful to you as an immediate workaround, but the bug is I think a legitimate issue in Sopel.

maybe a @plugin.priority('high') + @plugin.unblockable handler in coretasks

I'm a -1 on this since it means doing work in the core that may be serving a plugin that isn't even enabled. I imagine most Sopel instances do enable seen, but it feels too much like coupling between a plugin and the core. FWIW, I tried altering the seen plugin to run with "high" priority and it did not affect the behavior of the reproduction, so I'm not even sure if this approach will work, philosophical matters aside.

If we declare the slug column to be unique=True, we can at least force an IntegrityError in this collision case instead writing bad data to the DB, but I'm not sure what the rest of a patch that does this would look like, it wouldn't be very polite of us to issue an error to the user in this case, but I'm not sure that just retrying the transaction again is a good idea either. Still messing around with it…

dgw commented 1 year ago

I'm a -1 on this since it means doing work in the core that may be serving a plugin that isn't even enabled. I imagine most Sopel instances do enable seen, but it feels too much like coupling between a plugin and the core.

Think I was too brief before. I don't mean "a nick I haven't seen before" in the context of seen the plugin, but rather a nick that doesn't exist in the database yet. Sopel already adds to bot.users (and the lists for each channel) when it sees a JOIN, and it wouldn't be unreasonable to ping the database with (spitballing here) bot.db.get_nick_id(new_nick, create=true) so the database also is "aware of" the new nick, just as the bot object in memory is.

Maybe it would eliminate this particular race/TOCTOU, or maybe not—but it also seems unnecessarily laissez-faire to wait until someone calls bot.db.set_nick_value() to generate an ID. On the other hand, we don't want the DB to fill with useless rows for every temporary nickname that joins after a netsplit… Handling the issue directly does seem like the smart move, long term.

_(All that said, the database schema really needs a revision to combine the nick_ids and nicknames tables. The way it's implemented now is silly, and we know it—but I think it's a holdover from some quirk of SQLite autoincrement or something… not sure exactly, it predates my involvement. Making a schema change without first building out a migration path just seems "too YOLO", though.)_

SnoopJ commented 1 year ago

If we declare the slug column to be unique=True, we can at least force an IntegrityError in this collision case instead writing bad data to the DB, but I'm not sure what the rest of a patch that does this would look like

I'm a little more sure now what this solution would look like. Making the slug unique causes the IntegrityError to occur earlier than I had realized, when calling get_nick_id() which mutates the DB (yuck). The patch below tries to detect this condition and backs off the attempted creation of a new entry in the Nicknames table, and I can't reproduce the bug with this patch applied.

Click for patch diff

```diff diff --git a/sopel/db.py b/sopel/db.py index 2763495f..191d6d1c 100644 --- a/sopel/db.py +++ b/sopel/db.py @@ -24,7 +24,7 @@ import typing from sqlalchemy import Column, create_engine, ForeignKey, Integer, String from sqlalchemy.engine.url import make_url, URL -from sqlalchemy.exc import OperationalError +from sqlalchemy.exc import OperationalError, IntegrityError from sqlalchemy.orm import declarative_base, scoped_session, sessionmaker from sqlalchemy.sql import delete, func, select, update @@ -68,7 +68,7 @@ class Nicknames(BASE): __tablename__ = 'nicknames' __table_args__ = MYSQL_TABLE_ARGS nick_id = Column(Integer, ForeignKey('nick_ids.nick_id'), primary_key=True) - slug = Column(String(255), primary_key=True) + slug = Column(String(255), primary_key=True, unique=True) canonical = Column(String(255)) @@ -397,14 +397,26 @@ class SopelDB: session.add(nick_id) session.commit() - # Create a new Nickname - nickname = Nicknames( - nick_id=nick_id.nick_id, - slug=slug, - canonical=nick, - ) - session.add(nickname) - session.commit() + try: + # Create a new Nickname + nickname = Nicknames( + nick_id=nick_id.nick_id, + slug=slug, + canonical=nick, + ) + session.add(nickname) + session.commit() + except IntegrityError as exc: + if exc.args and "UNIQUE constraint failed: nicknames.slug" in exc.args[0]: + LOGGER.warning("Data race detected when calling get_nick_id()") + # another thread beat us to creating this entry, rollback and query again + session.rollback() + nickname = session.execute( + select(Nicknames).where(Nicknames.slug == slug) + ).scalar_one_or_none() + else: + # something else went wrong, re-raise + raise ```

However, this patch is incomplete, because a new nick_id is still generated by the earlier transaction in get_nick_id(). Below are the contents of my test bot's DB after a test with this patch applied, note that there are only two nicknames, but four nick_ids:

sqlite> SELECT * FROM nicknames;
2|snoopj|SnoopJ
3|terribot|terribot
sqlite> SELECT * FROM nick_ids;
1
2
3
4

It sounds like either the core task or DB revision that @dgw proposes would avoid this problem, with the core task neatly avoiding this issue altogether. I would be in favor of promptly populating the table before any plugins (including official ones) have the option to cause havoc, but I'm not familiar enough with these Sopel guts to know what that looks like.

ghost commented 8 months ago

I don't want to further slow 8.0.0 down, but could we add this to the 8.1.0 milestone?

dgw commented 8 months ago

Since get_nick_id() is the problematic code path, I'm sure we could cut the frequency of this issue down to (almost?) nothing using a Lock object and a well-placed with nick_id_lock: context manager.

And in the future, once we sort out the path toward proper database migrations (#1784), some simplified version of the ridiculous three-table schema we have for nicknames can be applied to channels, so they too can benefit from case-insensitivity. But I'm getting way ahead of this issue, huh?