Closed JWCook closed 2 years ago
Here's a rough draft of a SQLite-based bucket (closes #37). FIFO behavior is done with an auto-incrementing index (on write) and
ORDER BY
(on read).Would you prefer for this to be in its own module (and maybe put
RedisBucket
in its own module as well), or keep everything in thebucket
module?
I think the current change looks fine. it deserves to be in its own module.
This should be using prepared statements. While it may appear that all of the data being controlled explicitly by local code, it is feasible for the database to be accessed by other processes. There's almost never a reason not to use prepared statements and this is not an exception.
As for making get
more efficient, I'm not sure how much more efficient it could become. The documentation indicates that you could not simply delete a given number of rows without a special compilation flag, so selection and deletion must be done in separate steps. The only possible SQL speedup I could see would be making one transaction instead of two (easily done with a context manager), which is surely insignificant if you aren't trying to rate limit thousands of outbound requests per second.
The only exception I could see would be to store keys in memory with a RETURNING
clause on insertion and keep all the keys in memory as well, essentially using the database as a persistent duplication of an in-memory queue. This would eliminate the need for the first SELECT statement, but I'm not entirely sure if it would be worth it. Atomic functions would be very difficult and the speedup would be minimal.
I agree that the room for optimization with get
is minimal here, and probably not worth it.
There's almost never a reason not to use prepared statements
That's news to me. Can you elaborate? Python's sqlite3
API doesn't expose any functions for working with them directly, and only uses them internally for parametrized queries. Depending on connection settings, SQLite is already safe to use from multiple threads or processes, with the caveat that it internally locks and queues up concurrent write operations.
There's almost never a reason not to use prepared statements
That's news to me. Can you elaborate? Python's
sqlite3
API doesn't expose any functions for working with them directly, and only uses them internally for parametrized queries. Depending on connection settings, SQLite is already safe to use from multiple threads or processes, with the caveat that it internally locks and queues up concurrent write operations.
No, you are correct; I meant parameterization. That was a typo by me. Oops! In any case, there's almost never a reason not to use parameterized queries
What are your thoughts about the size speedup? It would be fairly easy to implement, I'm just not sure which option is better. Querying the database every time would be a little more robust if something else is using the same database for some reason simultaneously and messing with the bucket, but it still isn't atomic so unexpected issues there would just be less common, not prevented.
I think the only tradeoff would be implementation complexity versus speed. Perhaps setting up a speed test would be worth it?
Anyway, I think the implementation would be
def __init__(self, maxsize=0, identity: str = None, path: str = None, **kwargs):
...
self._size = self.connection.execute(f"SELECT COUNT(*) FROM {self.table}").fetchone()[0]
def size(self) -> int:
return self._size
def put(self, item: float) -> int:
...
self._size += 1
return 1
return 0
def get(self, number: int = 1) -> float:
...
self._size -= len(keys)
return len(keys)
What are your thoughts about the size speedup?
Based on some quick tests, SELECT COUNT
is faster than I thought, with an average time of 0.006 milliseconds on my machine for a table with 100,000 items. So I'm leaning towards just using SELECT COUNT
for now, unless a scenario comes up where it somehow runs slower than the intended request rate. Does that sound reasonable?
What are your thoughts about the size speedup?
Based on some quick tests,
SELECT COUNT
is faster than I thought, with an average time of 0.006 milliseconds on my machine for a table with 100,000 items. So I'm leaning towards just usingSELECT COUNT
for now, unless a scenario comes up where it somehow runs slower than the intended request rate. Does that sound reasonable?
I average about 0.001 milliseconds for a 1,000-row table and 2.2 milliseconds for a 1,000,000-row table. I expect this is fine for this use case.
Just finished up the tests and docs. Thanks for reviewing!
@vutran1710 I think this is ready to go, whenever you have time to take a look.
Actually... the table name should probably be parameterized, as it is user input
Edit: there does not seem to be functionality for this. I'll research it.
That did cross my mind, and some ORM/DBAPI libraries have features for sanitizing arbitrary input for SQL statements, but I don't know of an easy way do that with just the stdlib. Let me know if you find one, or you could add that in a separate PR if you'd like.
I don't think SQL injection is worth worrying about here here. Worst case is a user might pick a bucket name that's not a valid table name, get an exception, and have to pick a different name. Or are there other potential problems here that you can think of?
Okay, there really is no supported way of doing this. Unfortunately, our table names are user input and are even more susceptible to issues than most other areas.
Possible strategies:
identity
is used in queries without being checked. Check it yourself" in the docspyrate_22596363b3de40b0
is not very clean. There is technically a possibility of collisions, which could be ignored with a long enough hash.identity
is alphanumeric, or replace non-compliant characters with underscores. If it fails, throw a ValueError
. Replacement could cause collisions in niche use cases where multiple instances are running on the same SQL server and have weird identity
s, e.g. "hello#world" and "hello&world" For SQLite, if the table name is surrounded by quotes or brackets (which I'll add shortly) there are almost no naming restrictions except that it can't start with sqlite_
. So I'm not sure this is a problem.
import sqlite3
connection = sqlite3.connect('test.db')
# Surprisingly works
connection.execute('CREATE TABLE "somehow_valid⁉️ 🙃 🤷" (value)')
connection.execute('CREATE TABLE "; DROP TABLE test; --" (value)')
# Raises an error
connection.execute('CREATE TABLE "sqlite_something" (value)')
Actually... the table name should probably be parameterized, as it is user input Edit: there does not seem to be functionality for this. I'll research it.
That did cross my mind, and some ORM/DBAPI libraries have features for sanitizing arbitrary input for SQL statements, but I don't know of an easy way do that with just the stdlib. Let me know if you find one, or you could add that in a separate PR if you'd like.
I don't think SQL injection is worth worrying about here here. Worst case is a user might pick a bucket name that's not a valid table name, get an exception, and have to pick a different name. Or are there other potential problems here that you can think of?
Sure, there are other potential problems, but I also wouldn't consider that a non-issue. Let's take a contrived example. Someone made a proxy scraper service for a website. Unfortunately, user's username is used for the identity of their table. A malicious actor decides that she wants to attack the proxy scraper, so she decides to set her username to table_name (my new table definition) --
and suddenly she can make the table whatever she wants. She could, for example, add a bunch of stored generated columns that are designed to be needlessly expensive on storage and/or CPU and ruin other user's speed. She could remove the auto incrementation, which would cause issues we can't guarantee are handled by the library user. Perhaps she could use a generated stored blob column that takes up infinite space—that'd be hard to debug. She could do many things. I don't immediately see it could be done, but this could possibly open up further attacks.
We don't know how PyrateLimiter
will be used, and even if we did, if you can see an attack vector there's often more attacks using it than you can reasonably expect to consider.
Honestly, the more I think about it, the less I hate the hashing system. The only possible case where having non-descriptive table names here is an issue is if some other program is accessing our bucket. I'm not sure why that would ever happen, but if someone needed that for whatever reason they can get the names programmatically with .table
, so... whatever?
Any of the guaranteed hash algorithms would work, but using sha1 it would be as simple as:
import hashlib
class SQLiteBucket(AbstractBucket):
def __init__(self, maxsize=0, identity: str = None, path: str = None, **kwargs):
...
self._table = "someprefix_" + hashlib.sha1(identity.encode()).hexdigest()
Honestly, the more I think about it, the less I hate the hashing system.
Yeah, that looks reasonable to me too. I just added that.
Closed due to inactivity
@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.
@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.
oh so that means its done already? I thought you guys were having an unfinished conversation 😂
@vutran1710 Which changes do you think need to be implemented before merging this? This is complete enough for my own purposes, and I imagine 90% of other use cases. The other suggestions above could be separate issues/PRs if anyone really needs them.
oh so that means its done already? I thought you guys were having an unfinished conversation 😂
@JWCook anyway im happy to review it, but unfortunately you have to resolve the conflict first now.
Sure, would you like to reopen this PR, or should I create a new one?
Sure, would you like to reopen this PR, or should I create a new one?
up to you. im fine with either.
Alright, rebased and continued in #59.
Here's a rough draft of a SQLite-based bucket (closes #37). FIFO behavior is done with an auto-incrementing index (on write) and
ORDER BY
(on read).Would you prefer for this to be in its own module (and maybe put
RedisBucket
in its own module as well), or keep everything in thebucket
module?