move versioning to the code and handle unversioned S3 Buckets

candrsn commented 3 years ago

This PR fixes a problm when accessing data stored in an unversioned S3 bucket.

There is one other change. moving the version info into the code. (I can retract that part if it is not wanted)

This url demonstrates the problem https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg

A sample query might be with sqlite_s3_query.sqlite_s3_query("https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg") as query: with query("""SELECT fid, Id, ImageEventName FROM images LIMIT 4 OFFSET 1000""") as (cols, rows): ......

michalc commented 3 years ago

Hello 👋

Thanks for the PR

However, I think I'm fairly anti this supporting non-versioned buckets... I pretty much want it to fail if it's unversioned, so I know to activate versioning on it. (In fact in doing a bit of testing a day or two ago, it failed because I had forgotten to switch on versioning, and I was very pleased that it did)

I'm also torn on having the version in the code: what's the benefit of this?

michalc commented 3 years ago

After thinking further, I'm still happy with sqlite-s3-query not working with unversioned buckets.

Wrufesh commented 1 year ago

I am one of the user who do not want to enable versioning. Not sure why you think to cut off users who do not want to enable versioning.

michalc commented 1 year ago

Hi @Wrufesh,

It's to keep the scope of this project very focused - my aim is for it to be good at querying SQLite files on versioned buckets.

It's not a free thing to have code for more cases. The code will be more complex, and so there will costs. Firstly, a cost in terms of maintenance time to keep this "good" - for example lots of tests would have to run both on versioned and unversioned buckets. And secondly, a cost in terms of risk - having this behaviour in the code runs the risk of wanting versioning, but accidentally running on unversioned buckets, which then risks queries failing in strange ways if they run during the replace of the database object (or even worse, succeeding with strange results). The "REPEATABLE READ" behaviour that's mentioned in the README - this depends on the bucket being versioned. These costs are essentially incurred by myself and other users that do want to run with versioning.

Not sure if it's helpful to say, or even maybe a bit... condescending (and if so I apologise), but I know it's frustrating to be on the side that's deemed out of scope of a project: I have been there myself. The one thing I can emphasise is that the code is very permissibly licensed with its MIT license. You are free to add this behaviour into a fork and use the fork on unversioned buckets in virtually any project.

That all being said...

I am one of the user who do not want to enable versioning

I am curious as to why you don't want to enable versioning?

Michal

Wrufesh commented 1 year ago

Hello @michalc,

Thank you for your reply.

To be honest I have a limited understanding of how it works. I believe that we are creating virtual file system and making range request based on actions performed on vfs.

I was able to work with un-versioned buckets by making this small change. #84

Please take a moment to take a look and let me know if it is acceptable.

I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename. This is so because I am using s3 as a backend. The sqlite db written is latest index of some dataset being uploaded on my system.

Rupesh

michalc commented 1 year ago

I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename.

Ah so this is exactly the case when enforcing versioning is useful, since it allows this without risk:

A query starts on the object
The object is replaced
The query continues on the object

Without versioning, at step 3, sqlite will see a corrupted database because it finishes the query on one that is different to the one it started on. With versioning, step 3 will use the same version as it used in step 1, and so the query will complete successfully.

Nothing in versioning prevents you from always starting queries on the most recent version of an object at the time if that's what you want. In fact, that's exactly how I use sqlite-s3-query.

I was able to work with un-versioned buckets by making this small change. https://github.com/michalc/sqlite-s3-query/pull/84 Please take a moment to take a look and let me know if it is acceptable.

I'm going to say no for the reasons above

Wrufesh commented 1 year ago

Thank you for your response.

I have limited s3 resource, and I do not want to save different versions of objects. This is the case which only occurs when update happens, and in my case the update of the sqlite file is rare.

michalc commented 1 year ago

I have limited s3 resource, and I do not want to save different versions of objects.

Ah I didn't consider this. Then would a lifecycle policy in S3 be acceptable? I suspect the shortest time is 1 day, so there would be at most 1 day of both versions existing at the same time. So something like this configured in the console for the bucket?

michalc commented 1 year ago

I'm also wondering, how big are the objects that we're talking about?

michalc commented 1 year ago

my case the update of the sqlite file is rare.

Ah and also, how rare is expected? Once a day/month/year?

Wrufesh commented 1 year ago

I am using minio object store on private cloud. In my case s3 is used as backend. The created objects are always associated with some kind of batch job. Batch job can be versioned; output objects from versioned jobs are always constant.

In my use case there is no use of bucket versioning.

michalc commented 1 year ago

So I don't really understand I have to admit. But can I push on:

In my use case there is no use of bucket versioning.

Can you expand on why you can't enable it? I'm not sure if I'm missing something, but there shouldn't be a code impact - nothing needs to care about versioning if it doesn't want to, and cost impact I would imagine is relatively minimal in most cases - especially if you seem to have a non trivial setup with a few moving parts already.

And for background/context, are you looking for sqlite-s3-query to query minio, or S3?

Wrufesh commented 1 year ago

I am using sqlite-s3-query with minio as it supports s3 api.

Currently I am using the fork of sqlite-S3-query which doesn't require versioning.

I can ask the provider to turn on the versioning for me. But I want to be be free to choose as s3 object storage has given us freedom to choose whether to enable versioning or not.

S3 provider could force the user to use versioning but it choose to give freedom to it's user.

Similarly I propose to give here freedom as well.

That is just my personal opinion but I understand your concerns. 🙂

michalc commented 1 year ago

I've pondered a bit more, and I'm going to stick with requiring versioning

You can make a fork as you have
Or, it sounds like in your specific case, you can probably enable versioning if you need to
Or, I've just realised there there is another library https://github.com/litements/s3sqlite, that I don't think requires versioning

It sounds like there are options - from what I can tell, nothing in keeping this library for versioned buckets prevents you from achieving your aims.

michalc / sqlite-s3-query

move versioning to the code and handle unversioned S3 Buckets #10