pacman82 / odbc2parquet

A command line tool to query an ODBC data source and write the result into a parquet file.
MIT License
222 stars 20 forks source link

Negative row group ordinal: TryFromIntError(()) #652

Open andreypanchenko opened 13 hours ago

andreypanchenko commented 13 hours ago

Hello Markus,

I actively using your cli but unfortunately faced with this.

thread 'main' panicked at /home/redacted/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-53.0.0/src/file/writer.rs:369:10:
Negative row group ordinal: TryFromIntError(())
stack backtrace:
   0:     0x5cf28330138a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h1b9dad2a88e955ff
   1:     0x5cf28310a51b - core::fmt::write::h4b5a1270214bc4a7
   2:     0x5cf2832d4d02 - std::io::Write::write_fmt::hd04af345a50c312d
   3:     0x5cf283302584 - std::panicking::default_hook::{{closure}}::h96ab15e9936be7ed
   4:     0x5cf283302ebb - std::panicking::rust_panic_with_hook::hfe205f6954b2c97b
   5:     0x5cf2833028b5 - std::panicking::begin_panic_handler::{{closure}}::h6cb44b3a50f28c44
   6:     0x5cf283302819 - std::sys::backtrace::__rust_end_short_backtrace::hf1c1f2a92799bb0e
   7:     0x5cf283302804 - rust_begin_unwind
   8:     0x5cf28306f092 - core::panicking::panic_fmt::h3d8fc78294164da7
   9:     0x5cf28306f4f5 - core::result::unwrap_failed::hfa79a499befff387
  10:     0x5cf2831adef7 - parquet::file::writer::write_bloom_filters::he352109071214bb9
  11:     0x5cf2831abc5b - core::ops::function::FnOnce::call_once{{vtable.shim}}::h01774ca2bfef6332
  12:     0x5cf2831b3082 - parquet::file::writer::SerializedRowGroupWriter<W>::close::h95b1d12fdb6b3459
  13:     0x5cf28317364a - odbc2parquet::query::current_file::CurrentFile::write_row_group::haadd6da05f0d723c
  14:     0x5cf2831a75d3 - <odbc2parquet::query::parquet_writer::FileWriter as odbc2parquet::query::parquet_writer::ParquetOutput>::write_row_group::h546fc5674df32a7b
  15:     0x5cf2831aa160 - odbc2parquet::query::table_strategy::TableStrategy::block_cursor_to_parquet::h0a6a805f7e769197
  16:     0x5cf28318186c - odbc2parquet::main::h0ec0f3dfe8c71d2e
  17:     0x5cf2831b6ab3 - std::sys::backtrace::__rust_begin_short_backtrace::haafe2784505e331e
  18:     0x5cf283176659 - main
  19:     0x73ac2642a1ca - <unknown>
  20:     0x73ac2642a28b - __libc_start_main
  21:     0x5cf283077e45 - _start
  22:                0x0 - <unknown>

What I trying to do

  odbc2parquet query\
    --dsn "$dsn" \
    --batch-size-memory 3060Mib \
    --file-size-threshold 150Mib \
    "${table}_batch_${start_id}.parquet" \
    "$sql_query"

sql_query="SELECT * FROM ${database}.${table} WHERE id BETWEEN ${start_id} AND ${end_id}"

odbc2parquet 6.1.1

odbc2parquet list-drivers
mysql_unicode
        SETUP=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc9S.so
        UsageCount=1
        DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc9w.so

mysql_ansi
        DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc9a.so
        SETUP=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc9S.so
        UsageCount=1

table schema

create table my_schema.my_table
(
    suid            int auto_increment                 primary key,
    user_id      int unsigned default 0          not null,
    type           smallint unsigned                   null,
    created_at timestamp    default CURRENT_TIMESTAMP not null,
    settings      json                                        null,
    constraint id_unique
        unique (suid)
)
    charset = utf8;

create index created_at_idx
    on my_schema.my_table (created_at);

create index type_idx
    on my_schema.my_table (type);

create index user_idx
    on my_schema.my_table (user_id);
andreypanchenko commented 12 hours ago

UPDATE if I exclude ( settings json) field all works

pacman82 commented 5 hours ago

Hello @andreypanchenko ,

Sorry you encountered a bug. You could help me by logging with verbose output (add -vvv to the command line) and sharing the results with me.

Thanks

andreypanchenko commented 2 hours ago
2024-10-16T08:27:57+00:00 - INFO Fetched batch 32768 with 15 rows.
2024-10-16T08:27:57+00:00 - INFO Fetched 491520 rows in total.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 0 and name 'id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 1 and name 'shopping_trip_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 2 and name 'user_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 3 and name 'transition_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 4 and name 'shop_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 5 and name 'event_type'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 6 and name 'extension_version'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 7 and name 'activation_status'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 8 and name 'activation_state'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 9 and name 'data'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 10 and name 'created_at'.
2024-10-16T08:27:57+00:00 - INFO Fetched batch 32769 with 15 rows.
2024-10-16T08:27:57+00:00 - INFO Fetched 491535 rows in total.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 0 and name 'id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 1 and name 'shopping_trip_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 2 and name 'user_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 3 and name 'transition_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 4 and name 'shop_id'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 5 and name 'event_type'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 6 and name 'extension_version'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 7 and name 'activation_status'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 8 and name 'activation_state'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 9 and name 'data'.
2024-10-16T08:27:57+00:00 - DEBUG Writing column with index 10 and name 'created_at'.
thread 'main' panicked at /home/redacted/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-53.0.0/src/file/writer.rs:369:10:
Negative row group ordinal: TryFromIntError(())
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: parquet::file::writer::write_bloom_filters
   4: core::ops::function::FnOnce::call_once{{vtable.shim}}
   5: parquet::file::writer::SerializedRowGroupWriter<W>::close
   6: odbc2parquet::query::current_file::CurrentFile::write_row_group
   7: <odbc2parquet::query::parquet_writer::FileWriter as odbc2parquet::query::parquet_writer::ParquetOutput>::write_row_group
   8: odbc2parquet::query::table_strategy::TableStrategy::block_cursor_to_parquet
   9: odbc2parquet::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
/home/redacted/work/odbc_poc/eapi_gpt_id_file.sh: line 33: 1046689 Aborted                 (core dumped) odbc2parquet -vvv query --dsn "$dsn" --batch-size-memory 4072Mib --file-size-threshold 150Mib "${table}_batch_${start_id}.parquet" "$sql_query"
2024-10-16T08:27:58+00:00 - DEBUG ODBC Environment created.
2024-10-16T08:27:58+00:00 - DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x5cefe635dd60'
2024-10-16T08:27:58+00:00 - INFO Database Management System Name: MySQL
/home/redacted/work/odbc_poc/eapi_gpt_id_file.sh: line 33: 1046680 Aborted                 (core dumped) odbc2parquet -vvv query --dsn "$dsn" --batch-size-memory 4072Mib --file-size-threshold 150Mib "${table}_batch_${start_id}.parquet" "$sql_query"
2024-10-16T08:27:59+00:00 - DEBUG ODBC Environment created.
2024-10-16T08:27:59+00:00 - DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x56c72d37fd60'
2024-10-16T08:27:59+00:00 - INFO Database Management System Name: MySQL
/home/redacted/work/odbc_poc/eapi_gpt_id_file.sh: line 33: 1046676 Aborted                 (core dumped) odbc2parquet -vvv query --dsn "$dsn" --batch-size-memory 4072Mib --file-size-threshold 150Mib "${table}_batch_${start_id}.parquet" "$sql_query"
2024-10-16T08:28:00+00:00 - DEBUG ODBC Environment created.
2024-10-16T08:28:00+00:00 - DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x59bab2445d60'
2024-10-16T08:28:00+00:00 - INFO Database Management System Name: MySQL
/home/redacted/work/odbc_poc/eapi_gpt_id_file.sh: line 33: 1046691 Aborted                 (core dumped) odbc2parquet -vvv query --dsn "$dsn" --batch-size-memory 4072Mib --file-size-threshold 150Mib "${table}_batch_${start_id}.parquet" "$sql_query"
2024-10-16T08:28:01+00:00 - DEBUG ODBC Environment created.
2024-10-16T08:28:01+00:00 - DEBUG SQLAllocHandle allocated connection (Dbc) handle '0x6419e610dd60'
2024-10-16T08:28:01+00:00 - INFO Database Management System Name: MySQL
andreypanchenko commented 2 hours ago

short observation if I will decrease number rows with select to 250K rows all works starting from 500K it starts failing