Closed candrsn closed 3 years ago
Hello 👋
Thanks for the PR
However, I think I'm fairly anti this supporting non-versioned buckets... I pretty much want it to fail if it's unversioned, so I know to activate versioning on it. (In fact in doing a bit of testing a day or two ago, it failed because I had forgotten to switch on versioning, and I was very pleased that it did)
I'm also torn on having the version in the code: what's the benefit of this?
After thinking further, I'm still happy with sqlite-s3-query not working with unversioned buckets.
I am one of the user who do not want to enable versioning. Not sure why you think to cut off users who do not want to enable versioning.
Hi @Wrufesh,
It's to keep the scope of this project very focused - my aim is for it to be good at querying SQLite files on versioned buckets.
It's not a free thing to have code for more cases. The code will be more complex, and so there will costs. Firstly, a cost in terms of maintenance time to keep this "good" - for example lots of tests would have to run both on versioned and unversioned buckets. And secondly, a cost in terms of risk - having this behaviour in the code runs the risk of wanting versioning, but accidentally running on unversioned buckets, which then risks queries failing in strange ways if they run during the replace of the database object (or even worse, succeeding with strange results). The "REPEATABLE READ" behaviour that's mentioned in the README - this depends on the bucket being versioned. These costs are essentially incurred by myself and other users that do want to run with versioning.
Not sure if it's helpful to say, or even maybe a bit... condescending (and if so I apologise), but I know it's frustrating to be on the side that's deemed out of scope of a project: I have been there myself. The one thing I can emphasise is that the code is very permissibly licensed with its MIT license. You are free to add this behaviour into a fork and use the fork on unversioned buckets in virtually any project.
That all being said...
I am one of the user who do not want to enable versioning
I am curious as to why you don't want to enable versioning?
Michal
Hello @michalc,
Thank you for your reply.
To be honest I have a limited understanding of how it works. I believe that we are creating virtual file system and making range request based on actions performed on vfs.
I was able to work with un-versioned buckets by making this small change. #84
Please take a moment to take a look and let me know if it is acceptable.
I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename. This is so because I am using s3 as a backend. The sqlite db written is latest index of some dataset being uploaded on my system.
Rupesh
I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename.
Ah so this is exactly the case when enforcing versioning is useful, since it allows this without risk:
Without versioning, at step 3, sqlite will see a corrupted database because it finishes the query on one that is different to the one it started on. With versioning, step 3 will use the same version as it used in step 1, and so the query will complete successfully.
Nothing in versioning prevents you from always starting queries on the most recent version of an object at the time if that's what you want. In fact, that's exactly how I use sqlite-s3-query.
I was able to work with un-versioned buckets by making this small change. https://github.com/michalc/sqlite-s3-query/pull/84 Please take a moment to take a look and let me know if it is acceptable.
I'm going to say no for the reasons above
Thank you for your response.
I have limited s3 resource, and I do not want to save different versions of objects. This is the case which only occurs when update happens, and in my case the update of the sqlite file is rare.
I have limited s3 resource, and I do not want to save different versions of objects.
Ah I didn't consider this. Then would a lifecycle policy in S3 be acceptable? I suspect the shortest time is 1 day, so there would be at most 1 day of both versions existing at the same time. So something like this configured in the console for the bucket?
I'm also wondering, how big are the objects that we're talking about?
my case the update of the sqlite file is rare.
Ah and also, how rare is expected? Once a day/month/year?
I am using minio object store on private cloud. In my case s3 is used as backend. The created objects are always associated with some kind of batch job. Batch job can be versioned; output objects from versioned jobs are always constant.
In my use case there is no use of bucket versioning.
So I don't really understand I have to admit. But can I push on:
In my use case there is no use of bucket versioning.
Can you expand on why you can't enable it? I'm not sure if I'm missing something, but there shouldn't be a code impact - nothing needs to care about versioning if it doesn't want to, and cost impact I would imagine is relatively minimal in most cases - especially if you seem to have a non trivial setup with a few moving parts already.
And for background/context, are you looking for sqlite-s3-query to query minio, or S3?
I am using sqlite-s3-query with minio as it supports s3 api.
Currently I am using the fork of sqlite-S3-query which doesn't require versioning.
I can ask the provider to turn on the versioning for me. But I want to be be free to choose as s3 object storage has given us freedom to choose whether to enable versioning or not.
S3 provider could force the user to use versioning but it choose to give freedom to it's user.
Similarly I propose to give here freedom as well.
That is just my personal opinion but I understand your concerns. 🙂
I've pondered a bit more, and I'm going to stick with requiring versioning
It sounds like there are options - from what I can tell, nothing in keeping this library for versioned buckets prevents you from achieving your aims.
This PR fixes a problm when accessing data stored in an unversioned S3 bucket.
There is one other change. moving the version info into the code. (I can retract that part if it is not wanted)
This url demonstrates the problem https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg
A sample query might be with sqlite_s3_query.sqlite_s3_query("https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg") as query: with query("""SELECT fid, Id, ImageEventName FROM images LIMIT 4 OFFSET 1000""") as (cols, rows): ......