mariadb-corporation / mariadb-connector-nodejs

MariaDB Connector/Node.js is used to connect applications developed on Node.js to MariaDB and MySQL databases. MariaDB Connector/Node.js is LGPL licensed.
GNU Lesser General Public License v2.1
363 stars 93 forks source link

batch bulk insert gaps #263

Open Gabilabrebi opened 7 months ago

Gabilabrebi commented 7 months ago

When using batch to bulk insert, gaps in AUTO_INCREMENT IDs are created when reaching 50+ inserts even with innodb_autoinc_lock_mode set to 1 (consecutive)

The mySQL, documentation says :

With innodb_autoinc_lock_mode set to 0 (“traditional”) or 1 (“consecutive”), the auto-increment values generated by any given statement are consecutive, without gaps, because the table-level AUTO-INC lock is held until the end of the statement, and only one such statement can execute at a time.

So I'm not sure that's intended. Due to the fact that insertId, only return one ID, i think it's very important to not have gaps in IDs so we can deduce inserted ones.

rusher commented 7 months ago

Actually I'm not sure recreate the insert ids is the way to go. If ID's are needed, better to disable bulk, setting option bulk to false. This will permit to retrieve all id's, even if slower (this is still faster than normal INSERT * number of bunch of parameter, because of pipelining). In the near futur, this will be permitted with https://jira.mariadb.org/browse/MDEV-30366

crushton-qg commented 3 months ago

@rusher Hi, I am currently investigating an issue in our DB where we have gaps in inserted auto increment IDs using the batch function. We have managed to recreate the the bug twice, I have attached an image which shows we have run two separate batch queries with the same 5 rows of data, on both occasions the same ID gaps occurred at the same point.

I just want to check whether this is a bug in the batch function itself, or whether I need to investigate this further inside our DB / code config.

Screenshot 2024-04-17 at 16 00 31
crushton-qg commented 3 months ago

@rusher forgot to add that in the table in the image above, we currently have about 400k records but the highest id is currently over 500k, meaning about a 20% difference in count and IDs

crushton-qg commented 3 months ago

@Gabilabrebi did you manage to find a fix / resolution for this?

Gabilabrebi commented 3 months ago

@crushton-qg Yeah i just stopped using batch

for (let i = 0; i < order.content.length; i++) { const x = order.content[i]; contentInserts.push("(DEFAULT, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 0)"); contentParameters.push(place, orderId, i, x.item, x.version, x.comment, x.quantity, x.price, x.a, x.b, x.c); }

db.query(INSERT INTO orders_contents VALUES ${contentInserts.join(",")}, contentParameters);

Something like this basically achieve the same thing for me, but without any id issues. It's still one query and performance are pretty much the same as far as i tested.

crushton-qg commented 3 months ago

@Gabilabrebi Sorry to be a pain, can you provide a full code sample please? I have tried to replicate what you have but I cannot get it to work. I am struggling to follow your sample where you are providing multiple (DEFAULT, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 0) in your query string.

I have an array of arrays, each inner array containing the row data, i have changed table names, columns to generic name/data

I have tried adapting my query to:

            var contentInserts = [], contentParameters = [];
            for (const row of row_array)
            {
                contentInserts.push("(?,?,?,?,?,?,?,?,?)");
                contentParameters.push(row);
            }
            var dlo_rule_log_insert = await conn.query(`INSERT INTO table VALUES ${contentInserts.join(",")}`, contentParameters);

but I am getting a syntax error every time: { "name": "SqlError", "sqlMessage": "Parameter at position 3619 is not set", "sql": "INSERT INTO table VALUES (?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?),(...", "fatal": false, "errno": 45016, "sqlState": "HY000", "code": "ER_MISSING_PARAMETER" }

Gabilabrebi commented 3 months ago

@crushton-qg I guess you just need to change

contentParameters.push(row); to contentParameters.push(...row);

crushton-qg commented 3 months ago

@Gabilabrebi that was it, brilliant thanks for the help!

Gabilabrebi commented 1 month ago

@rusher Does using bulk = false as today ensure there is not gap between ids?

Is there any documentation on this option and what it does exactly?

I'm trying to use .batch again but still need contiguous ids

Regards

rusher commented 1 month ago

This is a little more complex than that, i wouldn't rely on id's without gap, servers ensure uniqueness, not strict "sequencing".

Server have AUTO_INCREMENT defaulting to 1 that indicate the increment between each IDs. For a single server, after each INSERT command, IDs will increment by one. This is not always the case, for galera servers for example (multi-master), each server has an increment offset, to permit avoiding to synchronized ID's current value on each INSERT command. For example for 3 galera servers,
2 INSERT on server A (offset 0), 4 on server B(offset +1) and 1 on SERVER C(offset +2): ID's created on server A will be 1 and 4. ID's created on server B will be 2, 5 and 8. ID's created on server C will be 3. So you'll have the resulting ids 1,2,3,4,5,8. synchronized, next id on server A will be 10.

The bulk option correspond to real batch command. I'm not completly sure about how this is implemented for bulk, server guys would be more able to help with the details, but what i suppose is that in order to deals with connections inserting to the same table, server can reserve a block of some ids when using bulk and that might result in gap in ids. Like, a server might reserved a block of 128 IDs for example and bulk will finally only result in 100 inserts, and during that time, other INSERTs might have been done. This is always for the sake of not blocking other connections if there is no need.

So, yes, if using only one server, no galera, without bulk, you will normally not have any gap, but i would clearly nor rely on that.

Gabilabrebi commented 1 month ago

@rusher Thank you for your answer.

In my case I've table A, B, and C. I insert many things in A, many things for each A in B with a reference to A.id, and many things in C for each B with a reference to b.id

It's currently 3 queries, one for A, for B, and one for C, but yes i rely on contiguous id.

If i shouldn't rely on contiguous ids, do I have to make tones of single query in loops to ensure to proper relation? Or is there any better alternative?

rusher commented 1 month ago

There is various solutions, but all of them rely on knowing the real insert id. for example, you can use a sequence to handle yourself id's, another solution is just take in account auto increment ids.

for batch, an option fullResult permit to retrieve all ids: Here is an example (could be simplified, i just write it quickly):

    await shareConn.query('DROP TABLE IF EXISTS B');
    await shareConn.query('DROP TABLE IF EXISTS A');

    const jsonVal = {
      Alan: ['Al'],
      Archibald: ['Archie', 'Baldo'],
      Benjamin: ['Ben', 'Bennie', 'Benjy', 'Jamie']
    };

    await shareConn.query('CREATE TABLE A(id int not null primary key auto_increment, val varchar(20))');
    await shareConn.query(
      'create table B (id int not null primary key auto_increment, nickname varchar(20), id_refA int, foreign key (id_refA) references A(id))'
    );

    const keys = Object.keys(jsonVal).map((x) => [x]);
    //keys : [ [ 'Alan' ], [ 'Archibald' ], [ 'Benjamin' ] ]

    const res = await shareConn.batch({ sql: 'INSERT INTO A(val) value (?)', fullResult: true }, keys);
    // res= [
    //   OkPacket { affectedRows: 1, insertId: 1n, warningStatus: 0 },
    //   OkPacket { affectedRows: 1, insertId: 2n, warningStatus: 0 },
    //   OkPacket { affectedRows: 1, insertId: 3n, warningStatus: 0 }
    // ]
    const aliases = keys
      .map((x, idx) => {
        const key = x[0];
        return jsonVal[key].map((alias) => [alias, res[idx].insertId]);
      })
      .flat();
    console.log(aliases);
    // aliases = [
    //   [ 'Al', 1n ],
    //   [ 'Archie', 2n ],
    //   [ 'Baldo', 2n ],
    //   [ 'Ben', 3n ],
    //   [ 'Bennie', 3n ],
    //   [ 'Benjy', 3n ],
    //   [ 'Jamie', 3n ]
    // ]

    const res2 = await shareConn.batch(
      { sql: 'INSERT INTO B(nickname, id_refA) value (?, ?)', fullResult: true },
      aliases
    );
Gabilabrebi commented 1 month ago

OH i didn't know about this fullResult: true option.

This is exactly what i was looking for. With this, no more need for contiguous ids. I just add this option when i want ids, and can still benefit full speed of batch when i don't need them.

Thanks a ton!