Add a SQLite bucket - Githubissues

JWCook commented 3 years ago

Here's a rough draft of a SQLite-based bucket (closes #37). FIFO behavior is done with an auto-incrementing index (on write) and ORDER BY (on read).

Would you prefer for this to be in its own module (and maybe put RedisBucket in its own module as well), or keep everything in the bucket module?

vutran1710 commented 3 years ago

Here's a rough draft of a SQLite-based bucket (closes #37). FIFO behavior is done with an auto-incrementing index (on write) and ORDER BY (on read).

Would you prefer for this to be in its own module (and maybe put RedisBucket in its own module as well), or keep everything in the bucket module?

I think the current change looks fine. it deserves to be in its own module.

B-McDonnell commented 3 years ago

This should be using prepared statements. While it may appear that all of the data being controlled explicitly by local code, it is feasible for the database to be accessed by other processes. There's almost never a reason not to use prepared statements and this is not an exception.

As for making get more efficient, I'm not sure how much more efficient it could become. The documentation indicates that you could not simply delete a given number of rows without a special compilation flag, so selection and deletion must be done in separate steps. The only possible SQL speedup I could see would be making one transaction instead of two (easily done with a context manager), which is surely insignificant if you aren't trying to rate limit thousands of outbound requests per second.

The only exception I could see would be to store keys in memory with a RETURNING clause on insertion and keep all the keys in memory as well, essentially using the database as a persistent duplication of an in-memory queue. This would eliminate the need for the first SELECT statement, but I'm not entirely sure if it would be worth it. Atomic functions would be very difficult and the speedup would be minimal.

JWCook commented 3 years ago

I agree that the room for optimization with get is minimal here, and probably not worth it.

There's almost never a reason not to use prepared statements

That's news to me. Can you elaborate? Python's sqlite3 API doesn't expose any functions for working with them directly, and only uses them internally for parametrized queries. Depending on connection settings, SQLite is already safe to use from multiple threads or processes, with the caveat that it internally locks and queues up concurrent write operations.

B-McDonnell commented 3 years ago

There's almost never a reason not to use prepared statements

That's news to me. Can you elaborate? Python's sqlite3 API doesn't expose any functions for working with them directly, and only uses them internally for parametrized queries. Depending on connection settings, SQLite is already safe to use from multiple threads or processes, with the caveat that it internally locks and queues up concurrent write operations.

No, you are correct; I meant parameterization. That was a typo by me. Oops! In any case, there's almost never a reason not to use parameterized queries

B-McDonnell commented 3 years ago

What are your thoughts about the size speedup? It would be fairly easy to implement, I'm just not sure which option is better. Querying the database every time would be a little more robust if something else is using the same database for some reason simultaneously and messing with the bucket, but it still isn't atomic so unexpected issues there would just be less common, not prevented.

I think the only tradeoff would be implementation complexity versus speed. Perhaps setting up a speed test would be worth it?

Anyway, I think the implementation would be

def __init__(self, maxsize=0, identity: str = None, path: str = None, **kwargs):
    ...
    self._size = self.connection.execute(f"SELECT COUNT(*) FROM {self.table}").fetchone()[0] 

def size(self) -> int:
    return self._size

def put(self, item: float) -> int:
    ...
        self._size += 1
        return 1
    return 0

def get(self, number: int = 1) -> float:
    ...
    self._size -= len(keys)
    return len(keys)

JWCook commented 3 years ago

What are your thoughts about the size speedup?

Based on some quick tests, SELECT COUNT is faster than I thought, with an average time of 0.006 milliseconds on my machine for a table with 100,000 items. So I'm leaning towards just using SELECT COUNT for now, unless a scenario comes up where it somehow runs slower than the intended request rate. Does that sound reasonable?

B-McDonnell commented 3 years ago

What are your thoughts about the size speedup?

Based on some quick tests, SELECT COUNT is faster than I thought, with an average time of 0.006 milliseconds on my machine for a table with 100,000 items. So I'm leaning towards just using SELECT COUNT for now, unless a scenario comes up where it somehow runs slower than the intended request rate. Does that sound reasonable?

I average about 0.001 milliseconds for a 1,000-row table and 2.2 milliseconds for a 1,000,000-row table. I expect this is fine for this use case.

JWCook commented 3 years ago

Just finished up the tests and docs. Thanks for reviewing!

JWCook commented 3 years ago

@vutran1710 I think this is ready to go, whenever you have time to take a look.

JWCook commented 3 years ago

Actually... the table name should probably be parameterized, as it is user input

Edit: there does not seem to be functionality for this. I'll research it.

That did cross my mind, and some ORM/DBAPI libraries have features for sanitizing arbitrary input for SQL statements, but I don't know of an easy way do that with just the stdlib. Let me know if you find one, or you could add that in a separate PR if you'd like.

I don't think SQL injection is worth worrying about here here. Worst case is a user might pick a bucket name that's not a valid table name, get an exception, and have to pick a different name. Or are there other potential problems here that you can think of?

B-McDonnell commented 3 years ago

Okay, there really is no supported way of doing this. Unfortunately, our table names are user input and are even more susceptible to issues than most other areas.

Possible strategies:

Do nothing. Instead, put a great big "by the way, identity is used in queries without being checked. Check it yourself" in the docs
Hash it. Suddenly the table names are no longer descriptive in the actual database. That would rarely be an issue, but it would certainly not be ideal if someone is using a single SQLite database for multiple things, for some reason (please don't do this with SQLite). A table name like pyrate_22596363b3de40b0 is not very clean. There is technically a possibility of collisions, which could be ignored with a long enough hash.
Restrict it artificially. For example, we could ensure that identity is alphanumeric, or replace non-compliant characters with underscores. If it fails, throw a ValueError. Replacement could cause collisions in niche use cases where multiple instances are running on the same SQL server and have weird identitys, e.g. "hello#world" and "hello&world"
Some weird combination of the prior two. Allow guaranteed-safe queries, hash and prefix the others. Not a fan of this because the behavior isn't consistent.

JWCook commented 3 years ago

For SQLite, if the table name is surrounded by quotes or brackets (which I'll add shortly) there are almost no naming restrictions except that it can't start with sqlite_. So I'm not sure this is a problem.

import sqlite3

connection = sqlite3.connect('test.db')

# Surprisingly works
connection.execute('CREATE TABLE "somehow_valid⁉️ 🙃 🤷" (value)')
connection.execute('CREATE TABLE "; DROP TABLE test; --" (value)')

# Raises an error
connection.execute('CREATE TABLE "sqlite_something" (value)')

B-McDonnell commented 3 years ago

Actually... the table name should probably be parameterized, as it is user input Edit: there does not seem to be functionality for this. I'll research it.

That did cross my mind, and some ORM/DBAPI libraries have features for sanitizing arbitrary input for SQL statements, but I don't know of an easy way do that with just the stdlib. Let me know if you find one, or you could add that in a separate PR if you'd like.

I don't think SQL injection is worth worrying about here here. Worst case is a user might pick a bucket name that's not a valid table name, get an exception, and have to pick a different name. Or are there other potential problems here that you can think of?

Sure, there are other potential problems, but I also wouldn't consider that a non-issue. Let's take a contrived example. Someone made a proxy scraper service for a website. Unfortunately, user's username is used for the identity of their table. A malicious actor decides that she wants to attack the proxy scraper, so she decides to set her username to table_name (my new table definition) -- and suddenly she can make the table whatever she wants. She could, for example, add a bunch of stored generated columns that are designed to be needlessly expensive on storage and/or CPU and ruin other user's speed. She could remove the auto incrementation, which would cause issues we can't guarantee are handled by the library user. Perhaps she could use a generated stored blob column that takes up infinite space—that'd be hard to debug. She could do many things. I don't immediately see it could be done, but this could possibly open up further attacks.

We don't know how PyrateLimiter will be used, and even if we did, if you can see an attack vector there's often more attacks using it than you can reasonably expect to consider.

B-McDonnell commented 3 years ago

Honestly, the more I think about it, the less I hate the hashing system. The only possible case where having non-descriptive table names here is an issue is if some other program is accessing our bucket. I'm not sure why that would ever happen, but if someone needed that for whatever reason they can get the names programmatically with .table, so... whatever?

Any of the guaranteed hash algorithms would work, but using sha1 it would be as simple as:

import hashlib
class SQLiteBucket(AbstractBucket):

    def __init__(self, maxsize=0, identity: str = None, path: str = None, **kwargs):
        ...
        self._table = "someprefix_" + hashlib.sha1(identity.encode()).hexdigest()

JWCook commented 3 years ago

Honestly, the more I think about it, the less I hate the hashing system.

Yeah, that looks reasonable to me too. I just added that.

vutran1710 commented 2 years ago

Closed due to inactivity

JWCook commented 2 years ago

@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.

vutran1710 commented 2 years ago

@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.

oh so that means its done already? I thought you guys were having an unfinished conversation 😂

vutran1710 commented 2 years ago

@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.

oh so that means its done already? I thought you guys were having an unfinished conversation 😂

vutran1710 commented 2 years ago

@JWCook anyway im happy to review it, but unfortunately you have to resolve the conflict first now.

JWCook commented 2 years ago

Sure, would you like to reopen this PR, or should I create a new one?

vutran1710 commented 2 years ago

Sure, would you like to reopen this PR, or should I create a new one?

up to you. im fine with either.

JWCook commented 2 years ago

Alright, rebased and continued in #59.

vutran1710 / PyrateLimiter

Add a SQLite bucket #38