speedb-io / speedb

A RocksDB compliant high performance scalable embedded key-value store
https://www.speedb.io/
Apache License 2.0
921 stars 72 forks source link

Getting an iterator error on db_stress #802

Closed ofriedma closed 10 months ago

ofriedma commented 11 months ago

How to reproduce:

mkdir /tmp/rocksdb_crashtest_expectedb1cqsg8 run this command: ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_wbm_stalls=0 --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_bits=5.680994457101304 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_ttl=0 --compare_full_db_state_snapshot=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --crash_test=1 --customopspercent=0 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_whiteboxxvsyhqno --db_write_buffer_size=0 --delpercent=17 --delrangepercent=5 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_speedb_features=1 --enable_thread_tracking=1 --expected_values_dir=/tmp/rocksdb_crashtest_expectedb1cqsg8 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=5 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --initiate_wbm_flushes=1 --iterpercent=17 --key_len_percent_dist=100 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_background_jobs=4 --max_bytes_for_level_base=67108864 --max_key=102400 --max_key_len=1 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=5 --num_file_reads_for_auto_readahead=0 --num_iterations=60 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --pinning_policy=scoped --prefix_size=1 --prefixpercent=6 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=20 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --seed=2468986424 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --start_delay_percent=70 --stats_dump_period_sec=10 --subcompactions=2 --sync=0 --sync_fault_injection=0 --sync_wal_one_in=100000 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --total_ram_size=1073741824 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_dynamic_delay=1 --use_full_merge_v1=False --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_before_write=False --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35

run this command:

./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_wbm_stalls=0 --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_bits=2 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_ttl=0 --compare_full_db_state_snapshot=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=2097151 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --crash_test=1 --customopspercent=0 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_whiteboxxvsyhqno --db_write_buffer_size=1048576 --delpercent=28 --delrangepercent=4 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_speedb_features=1 --enable_thread_tracking=0 --expected_values_dir=/tmp/rocksdb_crashtest_expectedb1cqsg8 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=3 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --initiate_wbm_flushes=0 --iterpercent=3 --key_len_percent_dist=100 --kill_exclude_prefixes=WritableFileWriter::Append,WritableFileWriter::WriteBuffered --kill_random_test=88889 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_background_jobs=8 --max_bytes_for_level_base=67108864 --max_key=102400 --max_key_len=1 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=5 --num_file_reads_for_auto_readahead=2 --num_iterations=19 --open_files=500000 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --pinning_policy=default --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=30 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --seed=2468986424 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --start_delay_percent=0 --stats_dump_period_sec=600 --subcompactions=3 --sync=1 --sync_fault_injection=0 --sync_wal_one_in=100000 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --total_ram_size=536870912 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_dynamic_delay=1 --use_full_merge_v1=False --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_before_write=False --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35

You should get the error similar to this:

2023-12-28T14:37:27.4899757Z Expected state has key 0000000000005982, iterator is at key 0000000000005966 2023-12-28T14:37:27.4901103Z Column family: default, op_logs: S 000000000000597D N SFP 000000000000598F P S 0000000000005986 NPNPP 2023-12-28T14:37:27.4902028Z Verification failed :(

hilikspdb commented 11 months ago

1) run with the flag kill_random_test=0 this will prevent db_stress from killing the test and it will failed constantly 2) i used ./ldb scan --from="0x0000000000001080" --to="0x0000000000001100" --db=/tmp/rocksdb_crashtest_whiteboxxvsyhqno/ --hex --no_value to get a full list of all the keys in the range ... the expected key is in the database

hilikspdb commented 11 months ago

the relevant dump is: "0x0000000000001089 0x000000000000108A 0x000000000000108B 0x000000000000108C 0x000000000000108D 0x000000000000108F 0x0000000000001092 0x0000000000001093 0x0000000000001094 0x0000000000001097 0x0000000000001098 0x00000000000010A3 0x00000000000010A4 0x00000000000010A6"

based on the database dump, SFP 00000000000010A0 should return 0000000000001098 and now "NNPPPNNPPPP" (N=next P = Prev) should return 0000000000001093 (4 next and 7 prev = 3 prev ). need to check my logic and if items were added after the iterator started. @Yuval-Ariel where is the patch that enable output of a trace file ...

hilikspdb commented 11 months ago

Opening the database again pass the verification phase. hence the problem is probably on deciding which entries are part of the iterator.

Yuval-Ariel commented 11 months ago

@hilikspdb 6ba24411c5e530b4f9346f7fd913aad9e972efc3 its on by default and the trace files are stored in the expected_values_dir

hilikspdb commented 11 months ago

thanks , anyway the second test fail even when it starts on an empty database .

hilikspdb commented 10 months ago

I have simplified the test: rm -rf /tmp/rocksdb/; ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/tmp/rocksdb_crashtest_whiteboxxvsyhqno --expected_values_dir=/tmp/rocksdb_crashtest_expectedb1cqsg8 --destroy_db_initially=1 --key_len_percent_dist=100 --kill_random_test=0 --max_key=102400 --max_key_len=1 --memtablerep=skip_list --seed=2468986424 --verify_iterator_with_expected_state_one_in=5 --write_buffer_size=33554432 --iterpercent=3 --readpercent=30 --writepercent=35 --delpercent=28 --delrangepercent=4 --prefixpercent=0

this test pass ok with memtable=skip-list and in 2.6 while crushes in main with memtablerep=hash_spdb ... therefore we left we two options : bug in translation or a change in the behaviour of db_stress

hilikspdb commented 10 months ago

here is even simpler version: rm -rf /tmp/rocksdb/; ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/tmp/rocksdb_crashtest_whiteboxxvsyhqno --expected_values_dir=/tmp/rocksdb_crashtest_expectedb1cqsg8 --destroy_db_initially=1 --key_len_percent_dist=100 --kill_random_test=0 --max_key=102400 --max_key_len=1 --memtablerep=hash_spdb --enable_speedb_features=false --seed=2468986424 --verify_iterator_with_expected_state_one_in=5 --write_buffer_size=33554432 --iterpercent=3 --readpercent=30 --writepercent=67 --delpercent=0 --delrangepercent=0 --prefixpercent=0

ofriedma commented 10 months ago

This commit : 6213b6342948f18e7a17201e858bdcf9cccf1490 seems to introduce the issue

udi-speedb commented 10 months ago

This commit : 6213b63 seems to introduce the issue

@ofriedma - that is a commit that was made before version 2.3.0?

ayulas commented 10 months ago

this commit looks a force push commit that was done to the plugin which was remove later on. any way it looks that changes were done to my commit which i didnt do and forced push. also there is an issue with db stress and enable speedb features as i mentioned in the past since the shared options doesnt support hardening. currently there is a wrong way how the db stress handles it. i will attach a branch that handles both cases and you can proceed from there

ofriedma commented 10 months ago

For 2.7 the error does not appear, so it has been revealed somewhere between 2.7 and main.

udi-speedb commented 10 months ago

@ofriedma - Are you referring to the rebased branch or to the official 2.7? And please clarify: You have said that you have identified the commit that is causing this, but this commit was made before version 2.3. I am confused.

ofriedma commented 10 months ago

official 2.7, It appears that it happened probably after the rebase, before v2.7 no errors(and the commit was there).

erez-speedb commented 10 months ago

@ayulas fixs are at branch /ayula/fix_bug @ofriedma @hilikspdb pls review them

ofriedma commented 10 months ago

@erez-speedb This issue is not related to db_stress enable_speedb_features becuase we can reproduce it without this feature entirely.

hilikspdb commented 10 months ago

Here is a quick description of the bug and what can be done:

  1. fix db_bench so new iterator will also use lock of the region that it is doing seek on
  2. mimic skip-list behaviour add the last vector in the time of seek and use the sequence number to decide if data is in or not.
udi-speedb commented 10 months ago
  1. I suppose locking the region in db_stress defeats RocsDB PR's purpose of testing simultaneous reads and writes?
  2. Shouldn't db_stress have some mechanism of snapshotting the expected data at the time they are creating the iterator? Or, alternatively, mark data that is written as you describe above as "may be returned or may not be returned, both are acceptable".
  3. Does our hash_spdb support tailing iteration?

On Thu, Jan 18, 2024 at 11:05 AM hilikspdb @.***> wrote:

Here is a quick description of the bug and what can be done:

  • write starts at time t and finishes at t+n
  • iterators starts at time t+k where k < n
  • seek is done at time t+x
  • db_stress internal locking mechanism prevents the SEEK to start before the end of the write.
  • The skip list iterator uses the mutable copy and the iterator sequence number (so data that were added to the memtable after the new iterator is in)
  • vector and hash-spdb are using an immutable copy (so data is not added) NOTE: this is not a bug in the memtable the definition of what to do with data in flight is not clear . There are two options to fix it:

    1. fix db_bench so new iterator will also use lock of the region that it is doing seek on
    2. mimic skip-list behaviour add the last vector in the time of seek and use the sequence number to decide if data is in or not.

— Reply to this email directly, view it on GitHub https://github.com/speedb-io/speedb/issues/802#issuecomment-1898068645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZKU2DG7H3LWDLBSI3T3L73YPDQUDAVCNFSM6AAAAABBM5KDQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGA3DQNRUGU . You are receiving this because you commented.Message ID: @.***>

hilikspdb commented 10 months ago
  1. As i said data that is written while the iterator is created should be in "unkown" state. The issue is that the lock is done during the first seek and not in the creation of the iterator (the lock is done on a range of data). this is a conceptual issue with the db_stress and the best place to fix it is there .
  2. Yes, this is the place to fix
  3. As far as i know there is no support for tailing iterators ..
ayulas commented 10 months ago

the issue is a result of the Refresh iterator that by definition is not suppose to be supported in the immutable iterator. (hash_spdb, vectors memtablerep) i added the supoorted flag to the memtablerep and if it is false we avoid doing refresh in no_batched_ops_stress test @hilikspdb @ofriedma pls review ayula/802-refresh-iter

ayulas commented 10 months ago

@ofriedma as we discussed i expose the allow refresh pls review cc249df6f20ef48b3aaae6d668e9d8fbb5a7c424