tarantool / crud

Easy assess to data stored in vshard cluster
BSD 2-Clause "Simplified" License
40 stars 15 forks source link

Add batch insert/upsert/insert_objects/upsert_objects #193

Closed no1seman closed 2 years ago

no1seman commented 3 years ago

When perform huge cold data loading it willbe great to have an availability of inserting/upserting data by batches (list of tuples/objects). This functions will be called directly from cartridge-java or any other clients.

olegrok commented 3 years ago

Related to https://github.com/tarantool/vshard/issues/176

Totktonada commented 3 years ago

@unera Please, highlight what is priority of the feature?

unera commented 3 years ago

both inserts.

no1seman commented 3 years ago

I should say that this task is not so easy as may seems to be. First of all @olegrok mentioned some additional functinality we need in vshard, also need to support this feature in some popular language connectors: Java/Go/python ... The main problem is how to report caller about successed or failed operations from the batch. Seems this feature must be additionally triaged

Totktonada commented 3 years ago

We can implement batching without cluster wide consistency guarantees for now (it requires 2PC, distributes transactions or something of this kind), but with detailed reporting about errors. Is there a need in such step toward?

Totktonada commented 3 years ago

@dsharonov agreed on that and highlighted that we should return an array of errors.

denesterov commented 3 years ago

There is another use case, much more common (and, I think, important) than bulk operations.

If you have a slightly complex data structure, not just Key/Document, you are in trouble with crud.

For example, we have an customer record in one space and his orders in second space, both sharded identically. Lets say we need to close one order and create another in one move, or just store customer record with all his orders at once, not interlacing with other changes / read operations.

CRUD cannot do this.

akudiyar commented 2 years ago

@denesterov

For example, we have an customer record in one space and his orders in second space, both sharded identically. Lets say we need to close one order and create another in one move, or just store customer record with all his orders at once, not interlacing with other changes / read operations.

I have a proposal of function registration API as a basis for implementing such cases: tarantool/cartridge#1799

Totktonada commented 2 years ago

Well, it seems we missed a moment to discuss the task. I should clarify what we're going to implement here.

All failed or not performed operations will be reported in an array of errors. Presence of errors means fail or partial success. Each error will contain a tuple (an object, an upsert operation). So it'll be possible to proceed with partial success (say, retry failed operations).

We'll add options to control behaviour of operations on storages:

(Maybe we'll change option names and defaults. This message is to give the idea.)

Insert and replace are enough to implement importing data into an empty sharded tarantool cluster.

We didn't take into account CDC usage scenario, which requires JDBC style batching with ability to mix different operations in a batch. We can work on it in a separate issue.

Sorry if you did expected more wide scope of work.