stoneatom / stonedb

StoneDB is an Open-Source MySQL HTAP and MySQL-Native DataBase for OLTP, Real-Time Analytics, a counterpart of MySQLHeatWave. (https://stonedb.io)
https://stonedb.io/
GNU General Public License v2.0
862 stars 139 forks source link

bug: ERROR 6 (HY000): An unknown system exception error caught. #1621

Open haitaoguan opened 1 year ago

haitaoguan commented 1 year ago

Have you read the Contributing Guidelines on issues?

Please confirm if bug report does NOT exists already ?

Describe the problem

###session1
mysql> create table ttt(id int,name varchar(5));
Query OK, 0 rows affected (0.02 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into ttt values(1,'AAA');
Query OK, 1 row affected (0.01 sec)

mysql> insert into ttt values(2,'BBB');
Query OK, 1 row affected (0.00 sec)

mysql> select * from ttt;
+------+------+
| id   | name |
+------+------+
|    1 | AAA  |
|    2 | BBB  |
+------+------+
2 rows in set (0.00 sec)

###session2
[root@test ~]# ps -ef|grep mysqld
mysql    19392  2626 22 15:02 ?        00:01:57 /opt/stonedb57/install//bin/mysqld --basedir=/opt/stonedb57/install
[root@test ~]# kill -9 19392

###session3
mysql> select * from ttt;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> select * from ttt;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    12
Current database: db

ERROR 6 (HY000): An unknown system exception error caught.

Expected behavior

No response

How To Reproduce

No response

Environment

./mysqld Ver 5.7.36-StoneDB-v1.0.3 for Linux on x86_64 (build-) build information as follow: Repository address: https://github.com/stoneatom/stonedb.git:stonedb-5.7-dev Branch name: stonedb-5.7-dev Last commit ID: 31919be Last commit time: Date: Thu Apr 20 10:19:54 2023 +0800 Build time: Date: Sun Apr 23 12:03:01 CST 2023

Are you interested in submitting a PR to solve the problem?

haitaoguan commented 1 year ago

The tianmu engine does not support transactions. If an instance crashes, either commit or rollback should occur instead of reporting an error.

RingsC commented 1 year ago

The error message reports listed below on session 3

mysql> select * from ttt;
No connection. Trying to reconnect...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/home/lihao/workshop/bin_ver1/tmp/mysql.sock' (111)
ERROR: 
Can't connect to the server

mysql> 

and then restart the instance and re-run query statement on session 2, then we get the following message:

mysql> select * from ttt;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    2
Current database: test

ERROR 6 (HY000): An unknown system exception error caught.
mysql> select * from ttt;
RingsC commented 1 year ago

The call stack of here listed.

#0  Tianmu::system::TianmuFile::OpenReadOnly (this=0x7f2da6db1f10, file="./test/ttt.tianmu/columns/0/DATA") at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/system/file.cpp:67
#1  0x000055b3f133e02a in Tianmu::core::PackInt::PackInt (this=0x7f2a94925030, dpn=0x7f2a62008fa8, pc=..., s=0x7f2a94920d00) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/data/pack_int.cpp:37
#2  0x000055b3f1141e49 in __gnu_cxx::new_allocator<Tianmu::core::PackInt>::construct<Tianmu::core::PackInt<Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu::core::COORD_TYPE)0, 3, Tianmu::core::object
_id_helper::empty> const&, Tianmu::core::ColumnShare*&> > (this=0x7f2da6db20ff, __p=0x7f2a94925030) at /usr/include/c++/9/ext/new_allocator.h:146
#3  0x000055b3f1140a86 in std::allocator_traits<std::allocator<Tianmu::core::PackInt> >::construct<Tianmu::core::PackInt<Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu::core::COORD_TYPE)0, 3, Tianmu
::core::object_id_helper::empty> const&, Tianmu::core::ColumnShare*&> > (__a=..., __p=0x7f2a94925030) at /usr/include/c++/9/bits/alloc_traits.h:483
#4  0x000055b3f113e329 in std::_Sp_counted_ptr_inplace<Tianmu::core::PackInt, std::allocator<Tianmu::core::PackInt>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<Tianmu::core::DPN*&, Tianmu::core
::ObjectId<(Tianmu::core::COORD_TYPE)0, 3, Tianmu::core::object_id_helper::empty> const&, Tianmu::core::ColumnShare*&> (this=0x7f2a94925020, __a=...) at /usr/include/c++/9/bits/shared_ptr_base.h:548
#5  0x000055b3f113ad72 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<Tianmu::core::PackInt, std::allocator<Tianmu::core::PackInt>, Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu
::core::COORD_TYPE)0, 3, Tianmu::core::object_id_helper::empty> const&, Tianmu::core::ColumnShare*&> (this=0x7f2da6db2318, __p=@0x7f2da6db2310: 0x0, __a=...)
    at /usr/include/c++/9/bits/shared_ptr_base.h:679
#6  0x000055b3f11375da in std::__shared_ptr<Tianmu::core::PackInt, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<Tianmu::core::PackInt>, Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu::co
re::COORD_TYPE)0, 3, Tianmu::core::object_id_helper::empty> const&, Tianmu::core::ColumnShare*&> (this=0x7f2da6db2310, __tag=...) at /usr/include/c++/9/bits/shared_ptr_base.h:1344
#7  0x000055b3f1134fa5 in std::shared_ptr<Tianmu::core::PackInt>::shared_ptr<std::allocator<Tianmu::core::PackInt>, Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu::core::COORD_TYPE)0, 3, Tianmu::cor
e::object_id_helper::empty> const&, Tianmu::core::ColumnShare*&> (this=0x7f2da6db2310, __tag=...) at /usr/include/c++/9/bits/shared_ptr.h:359
#8  0x000055b3f11329f9 in std::allocate_shared<Tianmu::core::PackInt, std::allocator<Tianmu::core::PackInt>, Tianmu::core::DPN*&, Tianmu::core::ObjectId<(Tianmu
#9  0x0000555fabed8e82 in Tianmu::core::TianmuAttr::Fetch (this=0x7fe73401ee00, pc=...) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/vc/tianmu_attr.cpp:907
#10 0x0000555fabee5cb7 in Tianmu::core::DataCache::GetOrFetchObject<Tianmu::core::Pack, Tianmu::core::ObjectId<(Tianmu::core::COORD_TYPE)0, 3, Tianmu::core::object_id_helper::empty>, Tianmu::core::Tianmu
Attr> (this=0x555fb1209e20, coord_=..., fetcher_=0x7fe73401ee00) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/data_cache.h:234
#11 0x0000555fabed8247 in Tianmu::core::TianmuAttr::LockPackForUse (this=0x7fe73401ee00, pn=0) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/vc/tianmu_attr.cpp:824
#12 0x0000555fabdce56e in Tianmu::core::TianmuTable::LockPackForUse (this=0x7fe73401e150, attr=0, pack_no=0) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/tianmu_table.cpp:512
#13 0x0000555fac0ed242 in Tianmu::core::VCPackGuardian::LockPackrowOnLockOneByThread (this=0x7fe734020588, mit=...) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/data/pack_guardian.cpp:121
#14 0x0000555fac0ecda6 in Tianmu::core::VCPackGuardian::LockPackrow (this=0x7fe734020588, mit=...) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/data/pack_guardian.cpp:63
#15 0x0000555fabd5ba08 in Tianmu::vcolumn::VirtualColumn::LockSourcePacks (this=0x7fe7340204b0, mit=...) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/vc/virtual_column.h:45
#16 0x0000555fac1a0845 in Tianmu::core::ParameterizedFilter::FilterDeletedByTable (this=0x7fe7348fb7c0, rcTable=0x7fe73401e150, no_dims=@0x7fea11ccda50: 0, tableIndex=0)
    at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/parameterized_filter.cpp:1710
#17 0x0000555fac1a0ac5 in Tianmu::core::ParameterizedFilter::FilterDeletedForSelectAll (this=0x7fe7348fb7c0) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/parameterized_filter.cpp:1737
#18 0x0000555fac19d4ff in Tianmu::core::ParameterizedFilter::UpdateMultiIndex (this=0x7fe7348fb7c0, count_only=false, limit=-1)
    at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/parameterized_filter.cpp:1121
#19 0x0000555fabd50d05 in Tianmu::core::Query::Preexecute (this=0x7fea11cce700, qu=..., sender=0x7fe734020420, display_now=true) at /home/lihao/workshop/stonedb-ver-1/storage/tianmu/core/query.cpp:797
(gdb) p file 
$12 = "./test/ttt.tianmu/columns/0/DATA"

But when the dir is listed on data directory, the path of file, ./test/ttt.tianmu/columns/0/DATA , cannot be found on my server.

xxx@ubuntu:~/workshop/bin_ver1/data/test/ttt.tianmu/columns/0$ ls
DN  filters  META  v

Therefore, in this function, the directory can be opened, and an exception thrown.

│   66          int TianmuFile::OpenReadOnly(std::string const &file) {                                                                                                                                   │
│  >67            return Open(file, O_RDONLY | O_LARGEFILE | O_BINARY, tianmu_umask);                                                                                                                     │
│   68          }  

The root cause is: after the instance was killed in begin statement, the data directory of that table was delete from disk. but the meta inforation about this table does not removed. the in-consistency is between data and its meta-data.

RingsC commented 1 year ago

Firstly, we catch the execption, and report the error message in detail. next stage, we fix up the in-consistency between data and its meta-data.

mysql> select * from ttt;
ERROR 1 (HY000): An Tianmu Error system exception error caught. ErrorCode: 2 - No such file or directory[./test/ttt.tianmu/columns/0/DATA]
mysql> 
RingsC commented 1 year ago

There're some places to create data directory, which is used to save the data into this directory.

In insertion phase, there are some types of LoadSource. which determine whether the data save to file or not immediately. For example:

TianmuAttr::LoadData () {
   xxx
    DPN &dpn = get_dpn(pi);
  if (current_txn_->LoadSource() == common::LoadSource::LS_File || dpn.numOfRecords == (1U << pss)) {
    Pack *pack = get_pack(pi);
    if (!dpn.Trivial()) {
      **pack->Save();**
    }

    if (pack) {
      pack->Unlock();
    }
    core::Engine *eng = reinterpret_cast<core::Engine *>(tianmu_hton->data);
    assert(eng);

    eng->cache.DropObject(get_pc(pi));
    dpn.SetRefCount(0);
  }
}

pack->Save() saves the data inot my_path/DATA. If it does not write the data into file, but saves to memory. it would not create the corresponding directory after it was killed. Therefore, a directory can not be found exception be thrown.

RingsC commented 1 year ago

This issue will be only occured with tianmu_insert_delayed=0. With this configuration. the data will be inserted directly into memory, not save to DATA file.

int Engine::InsertRow(const std::string &table_path, [[maybe_unused]] Transaction *trans_, TABLE *table,
                      std::shared_ptr<TableShare> &share) {
  int ret = 0;
  try {
    if (tianmu_sysvar_insert_delayed && table->s->tmp_table == NO_TMP_TABLE) {
      if (tianmu_sysvar_enable_rowstore) {
        ret = InsertToDelta(table_path, share, table);
      } else {
        InsertDelayed(table_path, table);
      }
      tianmu_stat.delta_insert++;
    } else {
      current_txn_->SetLoadSource(common::LoadSource::LS_Direct);  //insert directly with tianmu_insert_delay=0
      auto rct = current_txn_->GetTableByPath(table_path);
      ret = rct->Insert(table);
    }
    return ret;

it sets the load source to direct, in this mode in TianmuAttr::LoadData, it would not call pack->Save() to save the data to disk. Therefore, after the instance was killed, all the data lost, and DATA file was not created at that moment.

xxx@ubuntu:~/workshop/bin_ver1/data/test/ttt.tianmu/columns/0$ ls
DN  filters  META  v

it does not contains any file named with DATA in this directory.

RingsC commented 1 year ago

The behavior of transaction does not work.

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql>  insert into ttt values(1,'AAA');
Query OK, 1 row affected (0.01 sec)

mysql> insert into ttt values(2,'BBB');
Query OK, 1 row affected (0.00 sec)

mysql> select * from ttt;
+------+------+
| id   | name |
+------+------+
|    1 | AAA  |
|    2 | BBB  |
+------+------+
2 rows in set (0.00 sec)

mysql> rollback;
Query OK, 0 rows affected (0.00 sec)

mysql> quit
Bye
xxx@ubuntu:~/workshop/bin_ver1$ ./bin/mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.7.36-StoneDB-v1.0.3.2579bd4aa build-

Copyright (c) 2021, 2022 StoneAtom Group Holding Limited
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use test; 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from ttt; 
+------+------+
| id   | name |
+------+------+
|    1 | AAA  |
|    2 | BBB  |
+------+------+
2 rows in set (0.00 sec)

From the text above, after the rollback commaned executed, the data still remain existed here.

RingsC commented 1 year ago

Root cause: If we set the params of tianmu_insert_delayed=0, it will not write to DATA but write into memory. That is a obsoleted behavior. now that, all the data will write into row store.

Now, the solution: Write all the data to DATA immediately.

After that, it acts like below.

mysql> use test;
No connection. Trying to reconnect...
Connection id:    2
Current database: *** NONE ***

Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from ttt; 
+------+------+
| id   | name |
+------+------+
|    1 | AAA  |
|    2 | BBB  |
|    2 | BBB  |
+------+------+
3 rows in set (0.00 sec)

we cannot assure that the atomic of writting processing. if failed in wirtting a large amount of data into DATA and sync to disk, it may lead data inconsistency.

RingsC commented 1 year ago

In PR #1841, it can solve this unexpected exception, but it will raise the disk space unexpected useage in #1845. Compared with these two issue priorities, we firstly, revert PR#1841.