stoneatom / stonedb

StoneDB is an Open-Source MySQL HTAP and MySQL-Native DataBase for OLTP, Real-Time Analytics, a counterpart of MySQLHeatWave. (https://stonedb.io)
https://stonedb.io/
GNU General Public License v2.0
861 stars 140 forks source link

bug: Handling an Unexpected Halt of a Replica #1611

Open adofsauron opened 1 year ago

adofsauron commented 1 year ago

Reference:

https://dev.mysql.com/doc/refman/8.0/en/replication-features-transaction-inconsistencies.html

adofsauron commented 1 year ago

ACK

adofsauron commented 1 year ago

Common crash recovery lies in the process of a transaction, usually in the form of log redundancy, including undo logs and redo logs


---
LOG
---
Log sequence number 0 52087
Log flushed up to   0 52087
Last checkpoint at  0 52087
0 pending log writes, 0 pending chkp writes
20 log i/o's done, 0.00 log i/o's/second
adofsauron commented 1 year ago

Applies the hashed log records to the page, if the page lsn is less than the lsn of a log record. This can be called when a buffer page has just been read in, or also for a page already in the buffer pool.

adofsauron commented 1 year ago

Recover from a breakdown point


ibool recv_read_cp_info_for_backup(
    /*=========================*/
    /* out: TRUE if success */
    byte *hdr,        /* in: buffer containing the log group header */
    dulint *lsn,      /* out: checkpoint lsn */
    ulint *offset,    /* out: checkpoint offset in the log group */
    ulint *fsp_limit, /* out: fsp limit of space 0, 1000000000 if the
                    database is running with < version 3.23.50 of InnoDB */
    dulint *cp_no,    /* out: checkpoint number */
    dulint *first_header_lsn);
adofsauron commented 1 year ago

Recovers from a checkpoint. When this function returns, the database is able to start processing of new user transactions, but the function recv_recovery_from_checkpoint_finish should be called later to complete the recovery and free the resources used in it.

adofsauron commented 1 year ago

The recovery process by which a replica recovers from an unexpected halt varies depending on the configuration of the replica. The details of the recovery process are influenced by the chosen method of replication, whether the replica is single-threaded or multithreaded, and the setting of relevant system variables. The overall aim of the recovery process is to identify what transactions had already been applied on the replica's database before the unexpected halt occurred, and retrieve and apply the transactions that the replica missed following the unexpected halt.

adofsauron commented 1 year ago

For file position based replication, the recovery process needs an accurate replication SQL thread (applier) position showing the last transaction that was applied on the replica. Based on that position, the replication I/O thread (receiver) retrieves from the source's binary log all of the transactions that should be applied on the replica from that point on.

adofsauron commented 1 year ago

the recovery process fails if gaps in the sequence of transactions cannot be filled using the information in the relay log. For a single-threaded replica, the recovery process only needs to use the relay log if the relevant information is not available in the applier metadata repository.

adofsauron commented 1 year ago

Each rollback segment maintains a segment header page, which is allocated to 1024 slots (TRX_RSEG_N_SLOTS), each of which corresponds to an undo log object, so InnoDB theoretically supports a maximum of 96 * 1024 common transactions

adofsauron commented 1 year ago

Read the first log file header to print a note if this is a recovery from a restored Hot Backup

adofsauron commented 1 year ago

Start reading the log groups from the checkpoint lsn up. The variable contiguous_lsn contains an lsn up to which the log is known to be contiguously written to all log groups.

adofsauron commented 1 year ago

When a logical backup explicitly opens a read view and a long backup is performed, the purge operation on the slave_relay_log_info table cannot be done, resulting in a long version chain. When you start backing up the slave_relay_log_info table, it takes a long time to build the old version. The replication thread is caught waiting for the Page latch because it needs to update the slave_relay_log_info table, which could eventually cause the semaphore wait to time out and the instance to commit suicide

adofsauron commented 1 year ago

When the instance recovers from a crash, the ACTIVE transaction needs to be extracted from the undo. For the transaction in the active state, the transaction is rolled back directly. For the transaction in the Prepare state, if the corresponding binlog of the transaction has been recorded, the transaction is committed, otherwise the transaction is rolled back.

adofsauron commented 1 year ago

struct log_group_struct
{
  /* The following fields are protected by log_sys->mutex */
  ulint id;                /* log group id */
  ulint n_files;           /* number of files in the group */
  ulint file_size;         /* individual log file size in bytes,
                           including the log file header */
  ulint space_id;          /* file space which implements the log
                           group */
  ulint state;             /* LOG_GROUP_OK or
                           LOG_GROUP_CORRUPTED */
  dulint lsn;              /* lsn used to fix coordinates within
                           the log group */
  ulint lsn_offset;        /* the offset of the above lsn */
  ulint n_pending_writes;  /* number of currently pending flush
                          writes for this log group */
  byte **file_header_bufs; /* buffers for each file header in the
                          group */
  /*-----------------------------*/
  byte **archive_file_header_bufs; /* buffers for each file
                          header in the group */
  ulint archive_space_id;          /* file space which implements the log
                                  group archive */
  ulint archived_file_no;          /* file number corresponding to
                                  log_sys->archived_lsn */
  ulint archived_offset;           /* file offset corresponding to
                                   log_sys->archived_lsn, 0 if we have
                                   not yet written to the archive file
                                   number archived_file_no */
  ulint next_archived_file_no;     /* during an archive write,
                             until the write is completed, we
                             store the next value for
                             archived_file_no here: the write
                             completion function then sets the new
                             value to ..._file_no */
  ulint next_archived_offset;      /* like the preceding field */
  /*-----------------------------*/
  dulint scanned_lsn;   /* used only in recovery: recovery scan
                        succeeded up to this lsn in this log
                        group */
  byte *checkpoint_buf; /* checkpoint header is written from
                        this buffer to the group */
  UT_LIST_NODE_T(log_group_t)
  log_groups; /* list of log groups */
};
adofsauron commented 1 year ago

image

adofsauron commented 1 year ago

When the crash recover restarts, the lsn recorded in the checkpoint is read and redo logs are scanned from the lsn

adofsauron commented 1 year ago

Check whether the redo log is sufficient for every four external storage pages. If the redo log is insufficient, checkpoint lsn is advanced

adofsauron commented 1 year ago

If we are using the doublewrite method, we will check if there are half-written pages in data files, and restore them from the doublewrite buffer if possible

adofsauron commented 1 year ago

MLOG_CHECKPOINT logs record "CHECKPOINT LSN". If the LSN recorded in the log is the same as the "CHECKPOINT LSN" recorded in the log header, the matching MLOG_CHECKPOINT LSN is found. Record the scanned LSN number to recv_sys-> mlog_checkpoint_lsn

adofsauron commented 1 year ago

Log objects are stored using a hash structure, calculating hash values based on space id and page no, and changes on the same page are linked together as list nodes

adofsauron commented 1 year ago

struct recv_addr_struct
{
  ulint state;   /* RECV_NOT_PROCESSED, RECV_BEING_PROCESSED,
                 or RECV_PROCESSED */
  ulint space;   /* space id */
  ulint page_no; /* page number */
  UT_LIST_BASE_NODE_T(recv_t)
  rec_list; /* list of log records for this page */
  hash_node_t addr_hash;
};

/* Recovery system data structure */
typedef struct recv_sys_struct recv_sys_t;
struct recv_sys_struct
{
  mutex_t mutex; /* mutex protecting the fields apply_log_recs,
                 n_addrs, and the state field in each recv_addr
                 struct */
  ibool apply_log_recs;
  /* this is TRUE when log rec application to
  pages is allowed; this flag tells the
  i/o-handler if it should do log record
  application */
  ibool apply_batch_on;
  /* this is TRUE when a log rec application
  batch is running */
  dulint lsn; /* log sequence number */
  ulint last_log_buf_size;
  /* size of the log buffer when the database
  last time wrote to the log */
  byte *last_block;
  /* possible incomplete last recovered log
  block */
  byte *last_block_buf_start;
  /* the nonaligned start address of the
  preceding buffer */
  byte *buf; /* buffer for parsing log records */
  ulint len; /* amount of data in buf */
  dulint parse_start_lsn;
  /* this is the lsn from which we were able to
  start parsing log records and adding them to
  the hash table; ut_dulint_zero if a suitable
  start point not found yet */
  dulint scanned_lsn;
  /* the log data has been scanned up to this
  lsn */
  ulint scanned_checkpoint_no;
  /* the log data has been scanned up to this
  checkpoint number (lowest 4 bytes) */
  ulint recovered_offset;
  /* start offset of non-parsed log records in
  buf */
  dulint recovered_lsn;
  /* the log records have been parsed up to
  this lsn */
  dulint limit_lsn; /* recovery should be made at most up to this
                  lsn */
  ibool found_corrupt_log;
  /* this is set to TRUE if we during log
  scan find a corrupt log block, or a corrupt
  log record, or there is a log parsing
  buffer overflow */
  log_group_t *archive_group;
  /* in archive recovery: the log group whose
  archive is read */
  mem_heap_t *heap;        /* memory heap of log records and file
                           addresses*/
  hash_table_t *addr_hash; /* hash table of file addresses of pages */
  ulint n_addrs;           /* number of not processed hashed file
                           addresses in the hash table */
};
adofsauron commented 1 year ago

void recv_recover_page(
    /*==============*/
    ibool recover_backup, /* in: TRUE if we are recovering a backup
                          page: then we do not acquire any latches
                          since the page was read in outside the
                          buffer pool */
    ibool just_read_in,   /* in: TRUE if the i/o-handler calls this for
                          a freshly read page */
    page_t *page,         /* in: buffer page */
    ulint space,          /* in: space id */
    ulint page_no)        /* in: page number */
{
  buf_block_t *block = NULL;
  recv_addr_t *recv_addr;
  recv_t *recv;
  byte *buf;
  dulint start_lsn;
  dulint end_lsn;
  dulint page_lsn;
  dulint page_newest_lsn;
  ibool modification_to_page;
  ibool success;
  mtr_t mtr;

  mutex_enter(&(recv_sys->mutex));

  if (recv_sys->apply_log_recs == FALSE)
  {
    /* Log records should not be applied now */

    mutex_exit(&(recv_sys->mutex));

    return;
  }

  recv_addr = recv_get_fil_addr_struct(space, page_no);

  if ((recv_addr == NULL) || (recv_addr->state == RECV_BEING_PROCESSED) || (recv_addr->state == RECV_PROCESSED))
  {
    mutex_exit(&(recv_sys->mutex));

    return;
  }

  /* fprintf(stderr, "Recovering space %lu, page %lu\n", space, page_no); */

  recv_addr->state = RECV_BEING_PROCESSED;

  mutex_exit(&(recv_sys->mutex));

  mtr_start(&mtr);
  mtr_set_log_mode(&mtr, MTR_LOG_NONE);

  if (!recover_backup)
  {
    block = buf_block_align(page);

    if (just_read_in)
    {
      /* Move the ownership of the x-latch on the page to this OS
      thread, so that we can acquire a second x-latch on it. This
      is needed for the operations to the page to pass the debug
      checks. */

      rw_lock_x_lock_move_ownership(&(block->lock));
    }

    success = buf_page_get_known_nowait(RW_X_LATCH, page, BUF_KEEP_OLD, __FILE__, __LINE__, &mtr);
    ut_a(success);

#ifdef UNIV_SYNC_DEBUG
    buf_page_dbg_add_level(page, SYNC_NO_ORDER_CHECK);
#endif /* UNIV_SYNC_DEBUG */
  }

  /* Read the newest modification lsn from the page */
  page_lsn = mach_read_from_8(page + FIL_PAGE_LSN);

  if (!recover_backup)
  {
    /* It may be that the page has been modified in the buffer
    pool: read the newest modification lsn there */

    page_newest_lsn = buf_frame_get_newest_modification(page);

    if (!ut_dulint_is_zero(page_newest_lsn))
    {
      page_lsn = page_newest_lsn;
    }
  }
  else
  {
    /* In recovery from a backup we do not really use the buffer
    pool */

    page_newest_lsn = ut_dulint_zero;
  }

  modification_to_page = FALSE;
  start_lsn = end_lsn = ut_dulint_zero;

  recv = UT_LIST_GET_FIRST(recv_addr->rec_list);

  while (recv)
  {
    end_lsn = recv->end_lsn;

    if (recv->len > RECV_DATA_BLOCK_SIZE)
    {
      /* We have to copy the record body to a separate
      buffer */

      buf = mem_alloc(recv->len);

      recv_data_copy_to_buf(buf, recv);
    }
    else
    {
      buf = ((byte *)(recv->data)) + sizeof(recv_data_t);
    }

    if (recv->type == MLOG_INIT_FILE_PAGE)
    {
      page_lsn = page_newest_lsn;

      mach_write_to_8(page + UNIV_PAGE_SIZE - FIL_PAGE_END_LSN_OLD_CHKSUM, ut_dulint_zero);
      mach_write_to_8(page + FIL_PAGE_LSN, ut_dulint_zero);
    }

    if (ut_dulint_cmp(recv->start_lsn, page_lsn) >= 0)
    {
      if (!modification_to_page)
      {
        modification_to_page = TRUE;
        start_lsn = recv->start_lsn;
      }

#ifdef UNIV_DEBUG
      if (log_debug_writes)
      {
        fprintf(stderr, "InnoDB: Applying log rec type %lu len %lu to space %lu page no %lu\n", (ulong)recv->type,
                (ulong)recv->len, (ulong)recv_addr->space, (ulong)recv_addr->page_no);
      }
#endif /* UNIV_DEBUG */

      recv_parse_or_apply_log_rec_body(recv->type, buf, buf + recv->len, page, &mtr);
      mach_write_to_8(page + UNIV_PAGE_SIZE - FIL_PAGE_END_LSN_OLD_CHKSUM, ut_dulint_add(recv->start_lsn, recv->len));
      mach_write_to_8(page + FIL_PAGE_LSN, ut_dulint_add(recv->start_lsn, recv->len));
    }

    if (recv->len > RECV_DATA_BLOCK_SIZE)
    {
      mem_free(buf);
    }

    recv = UT_LIST_GET_NEXT(rec_list, recv);
  }

  mutex_enter(&(recv_sys->mutex));

  if (ut_dulint_cmp(recv_max_page_lsn, page_lsn) < 0)
  {
    recv_max_page_lsn = page_lsn;
  }

  recv_addr->state = RECV_PROCESSED;

  ut_a(recv_sys->n_addrs);
  recv_sys->n_addrs--;

  mutex_exit(&(recv_sys->mutex));

  if (!recover_backup && modification_to_page)
  {
    ut_a(block);

    buf_flush_recv_note_modification(block, start_lsn, end_lsn);
  }

  /* Make sure that committing mtr does not change the modification
  lsn values of page */

  mtr.modifications = FALSE;

  mtr_commit(&mtr);
}
adofsauron commented 1 year ago

static byte *recv_parse_or_apply_log_rec_body(
    /*=============================*/
    /* out: log record end, NULL if not a complete
    record */
    byte type,     /* in: type */
    byte *ptr,     /* in: pointer to a buffer */
    byte *end_ptr, /* in: pointer to the buffer end */
    page_t *page,  /* in: buffer page or NULL; if not NULL, then the log
                   record is applied to the page, and the log record
                   should be complete then */
    mtr_t *mtr)    /* in: mtr or NULL; should be non-NULL if and only if
                   page is non-NULL */
{
  dict_index_t *index = NULL;

  switch (type)
  {
    case MLOG_1BYTE:
    case MLOG_2BYTES:
    case MLOG_4BYTES:
    case MLOG_8BYTES:
      ptr = mlog_parse_nbytes(type, ptr, end_ptr, page);
      break;
    case MLOG_REC_INSERT:
    case MLOG_COMP_REC_INSERT:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_INSERT, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_cur_parse_insert_rec(FALSE, ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_REC_CLUST_DELETE_MARK:
    case MLOG_COMP_REC_CLUST_DELETE_MARK:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_CLUST_DELETE_MARK, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_cur_parse_del_mark_set_clust_rec(ptr, end_ptr, index, page);
      }
      break;
    case MLOG_COMP_REC_SEC_DELETE_MARK:
      /* This log record type is obsolete, but we process it for
      backward compatibility with MySQL 5.0.3 and 5.0.4. */
      ut_a(!page || page_is_comp(page));
      ptr = mlog_parse_index(ptr, end_ptr, TRUE, &index);
      if (!ptr)
      {
        break;
      }
      /* Fall through */
    case MLOG_REC_SEC_DELETE_MARK:
      ptr = btr_cur_parse_del_mark_set_sec_rec(ptr, end_ptr, page);
      break;
    case MLOG_REC_UPDATE_IN_PLACE:
    case MLOG_COMP_REC_UPDATE_IN_PLACE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_UPDATE_IN_PLACE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_cur_parse_update_in_place(ptr, end_ptr, page, index);
      }
      break;
    case MLOG_LIST_END_DELETE:
    case MLOG_COMP_LIST_END_DELETE:
    case MLOG_LIST_START_DELETE:
    case MLOG_COMP_LIST_START_DELETE:
      if (NULL != (ptr = mlog_parse_index(
                       ptr, end_ptr, type == MLOG_COMP_LIST_END_DELETE || type == MLOG_COMP_LIST_START_DELETE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_parse_delete_rec_list(type, ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_LIST_END_COPY_CREATED:
    case MLOG_COMP_LIST_END_COPY_CREATED:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_LIST_END_COPY_CREATED, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_parse_copy_rec_list_to_created_page(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_PAGE_REORGANIZE:
    case MLOG_COMP_PAGE_REORGANIZE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_PAGE_REORGANIZE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_parse_page_reorganize(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_PAGE_CREATE:
    case MLOG_COMP_PAGE_CREATE:
      ptr = page_parse_create(ptr, end_ptr, type == MLOG_COMP_PAGE_CREATE, page, mtr);
      break;
    case MLOG_UNDO_INSERT:
      ptr = trx_undo_parse_add_undo_rec(ptr, end_ptr, page);
      break;
    case MLOG_UNDO_ERASE_END:
      ptr = trx_undo_parse_erase_page_end(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_INIT:
      ptr = trx_undo_parse_page_init(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_HDR_DISCARD:
      ptr = trx_undo_parse_discard_latest(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_HDR_CREATE:
    case MLOG_UNDO_HDR_REUSE:
      ptr = trx_undo_parse_page_header(type, ptr, end_ptr, page, mtr);
      break;
    case MLOG_REC_MIN_MARK:
    case MLOG_COMP_REC_MIN_MARK:
      ptr = btr_parse_set_min_rec_mark(ptr, end_ptr, type == MLOG_COMP_REC_MIN_MARK, page, mtr);
      break;
    case MLOG_REC_DELETE:
    case MLOG_COMP_REC_DELETE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_DELETE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_cur_parse_delete_rec(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_IBUF_BITMAP_INIT:
      ptr = ibuf_parse_bitmap_init(ptr, end_ptr, page, mtr);
      break;
    case MLOG_INIT_FILE_PAGE:
      ptr = fsp_parse_init_file_page(ptr, end_ptr, page);
      break;
    case MLOG_WRITE_STRING:
      ptr = mlog_parse_string(ptr, end_ptr, page);
      break;
    case MLOG_FILE_CREATE:
    case MLOG_FILE_RENAME:
    case MLOG_FILE_DELETE:
      ptr = fil_op_log_parse_or_replay(ptr, end_ptr, type, FALSE, ULINT_UNDEFINED);
      break;
    default:
      ptr = NULL;
      recv_sys->found_corrupt_log = TRUE;
  }

  ut_ad(!page || ptr);
  if (index)
  {
    dict_table_t *table = index->table;
    mem_heap_free(index->heap);
    mutex_free(&(table->autoinc_mutex));
    mem_heap_free(table->heap);
  }

  return (ptr);
}
adofsauron commented 1 year ago

recv_parse_log_recs
    --> recv_parse_log_rec
        --> recv_parse_or_apply_log_rec_body
adofsauron commented 1 year ago

When the primary database crashes, the binlog is not delivered to the secondary database. If we directly promote the secondary database to the primary database, the primary database will be inconsistent with the secondary database. The old primary database must be redone according to the new primary database to restore the status

adofsauron commented 1 year ago

static byte *recv_parse_or_apply_log_rec_body(
    /*=============================*/
    /* out: log record end, NULL if not a complete
    record */
    byte type,     /* in: type */
    byte *ptr,     /* in: pointer to a buffer */
    byte *end_ptr, /* in: pointer to the buffer end */
    page_t *page,  /* in: buffer page or NULL; if not NULL, then the log
                   record is applied to the page, and the log record
                   should be complete then */
    mtr_t *mtr)    /* in: mtr or NULL; should be non-NULL if and only if
                   page is non-NULL */
{
  dict_index_t *index = NULL;

  switch (type)
  {
    case MLOG_1BYTE:
    case MLOG_2BYTES:
    case MLOG_4BYTES:
    case MLOG_8BYTES:
      ptr = mlog_parse_nbytes(type, ptr, end_ptr, page);
      break;
    case MLOG_REC_INSERT:
    case MLOG_COMP_REC_INSERT:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_INSERT, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_cur_parse_insert_rec(FALSE, ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_REC_CLUST_DELETE_MARK:
    case MLOG_COMP_REC_CLUST_DELETE_MARK:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_CLUST_DELETE_MARK, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_cur_parse_del_mark_set_clust_rec(ptr, end_ptr, index, page);
      }
      break;
    case MLOG_COMP_REC_SEC_DELETE_MARK:
      /* This log record type is obsolete, but we process it for
      backward compatibility with MySQL 5.0.3 and 5.0.4. */
      ut_a(!page || page_is_comp(page));
      ptr = mlog_parse_index(ptr, end_ptr, TRUE, &index);
      if (!ptr)
      {
        break;
      }
      /* Fall through */
    case MLOG_REC_SEC_DELETE_MARK:
      ptr = btr_cur_parse_del_mark_set_sec_rec(ptr, end_ptr, page);
      break;
    case MLOG_REC_UPDATE_IN_PLACE:
    case MLOG_COMP_REC_UPDATE_IN_PLACE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_UPDATE_IN_PLACE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_cur_parse_update_in_place(ptr, end_ptr, page, index);
      }
      break;
    case MLOG_LIST_END_DELETE:
    case MLOG_COMP_LIST_END_DELETE:
    case MLOG_LIST_START_DELETE:
    case MLOG_COMP_LIST_START_DELETE:
      if (NULL != (ptr = mlog_parse_index(
                       ptr, end_ptr, type == MLOG_COMP_LIST_END_DELETE || type == MLOG_COMP_LIST_START_DELETE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_parse_delete_rec_list(type, ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_LIST_END_COPY_CREATED:
    case MLOG_COMP_LIST_END_COPY_CREATED:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_LIST_END_COPY_CREATED, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_parse_copy_rec_list_to_created_page(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_PAGE_REORGANIZE:
    case MLOG_COMP_PAGE_REORGANIZE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_PAGE_REORGANIZE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = btr_parse_page_reorganize(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_PAGE_CREATE:
    case MLOG_COMP_PAGE_CREATE:
      ptr = page_parse_create(ptr, end_ptr, type == MLOG_COMP_PAGE_CREATE, page, mtr);
      break;
    case MLOG_UNDO_INSERT:
      ptr = trx_undo_parse_add_undo_rec(ptr, end_ptr, page);
      break;
    case MLOG_UNDO_ERASE_END:
      ptr = trx_undo_parse_erase_page_end(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_INIT:
      ptr = trx_undo_parse_page_init(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_HDR_DISCARD:
      ptr = trx_undo_parse_discard_latest(ptr, end_ptr, page, mtr);
      break;
    case MLOG_UNDO_HDR_CREATE:
    case MLOG_UNDO_HDR_REUSE:
      ptr = trx_undo_parse_page_header(type, ptr, end_ptr, page, mtr);
      break;
    case MLOG_REC_MIN_MARK:
    case MLOG_COMP_REC_MIN_MARK:
      ptr = btr_parse_set_min_rec_mark(ptr, end_ptr, type == MLOG_COMP_REC_MIN_MARK, page, mtr);
      break;
    case MLOG_REC_DELETE:
    case MLOG_COMP_REC_DELETE:
      if (NULL != (ptr = mlog_parse_index(ptr, end_ptr, type == MLOG_COMP_REC_DELETE, &index)))
      {
        ut_a(!page || (ibool) !!page_is_comp(page) == index->table->comp);
        ptr = page_cur_parse_delete_rec(ptr, end_ptr, index, page, mtr);
      }
      break;
    case MLOG_IBUF_BITMAP_INIT:
      ptr = ibuf_parse_bitmap_init(ptr, end_ptr, page, mtr);
      break;
    case MLOG_INIT_FILE_PAGE:
      ptr = fsp_parse_init_file_page(ptr, end_ptr, page);
      break;
    case MLOG_WRITE_STRING:
      ptr = mlog_parse_string(ptr, end_ptr, page);
      break;
    case MLOG_FILE_CREATE:
    case MLOG_FILE_RENAME:
    case MLOG_FILE_DELETE:
      ptr = fil_op_log_parse_or_replay(ptr, end_ptr, type, FALSE, ULINT_UNDEFINED);
      break;
    default:
      ptr = NULL;
      recv_sys->found_corrupt_log = TRUE;
  }

  ut_ad(!page || ptr);
  if (index)
  {
    dict_table_t *table = index->table;
    mem_heap_free(index->heap);
    mutex_free(&(table->autoinc_mutex));
    mem_heap_free(table->heap);
  }

  return (ptr);
}
adofsauron commented 1 year ago

cnt:1 bzize:16384 totalsize:212992 cursize:16384
cnt:2 bzize:16384 totalsize:212992 cursize:32768
cnt:3 bzize:16384 totalsize:212992 cursize:49152
cnt:4 bzize:16384 totalsize:212992 cursize:65536
hint2
index_id:37 level:1 next_offset:4294967295 offset:3
cnt:5 bzize:16384 totalsize:212992 cursize:81920
hint1
index_id:37 level:0 next_offset:5 offset:4
cnt:6 bzize:16384 totalsize:212992 cursize:98304
hint1
index_id:37 level:0 next_offset:6 offset:5
cnt:7 bzize:16384 totalsize:212992 cursize:114688
hint1
index_id:37 level:0 next_offset:7 offset:6
cnt:8 bzize:16384 totalsize:212992 cursize:131072
hint1
index_id:37 level:0 next_offset:8 offset:7
cnt:9 bzize:16384 totalsize:212992 cursize:147456
hint1
index_id:37 level:0 next_offset:9 offset:8
cnt:10 bzize:16384 totalsize:212992 cursize:163840
hint1
index_id:37 level:0 next_offset:10 offset:9
cnt:11 bzize:16384 totalsize:212992 cursize:180224
hint1
index_id:37 level:0 next_offset:11 offset:10
cnt:12 bzize:16384 totalsize:212992 cursize:196608
hint1
index_id:37 level:0 next_offset:4294967295 offset:11
cnt:13 bzize:16384 totalsize:212992 cursize:212992
===INDEX_ID:37
level1 total block is (1)
block_no:         3,level:   1|*|
level0 total block is (8)
block_no:         4,level:   0|*|block_no:         5,level:   0|*|block_no:         6,level:   0|*|
block_no:         7,level:   0|*|block_no:         8,level:   0|*|block_no:         9,level:   0|*|
block_no:        10,level:   0|*|block_no:        11,level:   0|*|
adofsauron commented 1 year ago

block_no:3          space_id:20           index_id:37          
slot_nums:3         heaps_rows:10         n_rows:8         
heap_top:224        del_bytes:0           last_ins_offset:216        
page_dir:2          page_n_dir:7          
leaf_inode_space:20         leaf_inode_pag_no:2         
leaf_inode_offset:242       
no_leaf_inode_space:20      no_leaf_inode_pag_no:2         
no_leaf_inode_offset:50        
last_modify_lsn:3011837
page_type:B+_TREE level:1     
adofsauron commented 1 year ago

If an operating system, storage subsystem, or unexpected mysqld process exits during a page write, a good copy of the page can be found from the dual write buffer during crash recovery

adofsauron commented 1 year ago

Undo logs exist within undo log segments, which are contained within rollback segments. Rollback segments reside in the system tablespace, in undo tablespaces, and in the temporary tablespace

adofsauron commented 1 year ago

The number of transactions supported by a rollback segment depends on the number of undo slots in the rollback segment and the number of undo logs required per transaction

adofsauron commented 1 year ago

Transactions that perform INSERT, UPDATE, and DELETE operations on regular and temporary tables require a full allocation of four undo logs. Transactions that perform INSERT operations only on regular tables require a single undo log

adofsauron commented 1 year ago

664e1b9fc5a3aa24507adbdbfe260cca

adofsauron commented 1 year ago

1682478712439_F23E2B60-3E93-42d7-BECE-9AEB1E35BDE6

adofsauron commented 1 year ago

On the rocksdb system, you need to coordinate the relationship with the master database

adofsauron commented 1 year ago

class PageHeader { friend NdbOut& operator<<(NdbOut&, const PageHeader&); public: bool check(); Uint32 getLogRecordSize(); bool lastPage(); Uint32 lastWord(); protected: Uint32 m_checksum; Uint32 m_lap; Uint32 m_max_gci_completed; Uint32 m_max_gci_started; Uint32 m_next_page; Uint32 m_previous_page; Uint32 m_ndb_version; Uint32 m_number_of_logfiles; Uint32 m_current_page_index; Uint32 m_old_prepare_file_number; Uint32 m_old_prepare_page_reference; Uint32 m_dirty_flag; / Debug info Start / Uint32 m_log_timer; Uint32 m_page_i_value; Uint32 m_place_written_from; Uint32 m_page_no; Uint32 m_file_no; Uint32 m_word_written; Uint32 m_in_writing_flag; Uint32 m_prev_page_no; Uint32 m_in_free_list; / Debug info End / };

adofsauron commented 1 year ago

trx_rseg_t *rseg;

ifdef UNIV_SYNC_DEBUG

ut_ad(mutex_own(&kernel_mutex));

endif / UNIV_SYNC_DEBUG /

ut_ad(trx->rseg == NULL);

if (trx->type == TRX_PURGE) { trx->id = ut_dulint_zero; trx->conc_state = TRX_ACTIVE; trx->start_time = time(NULL);

return (TRUE);

}

ut_ad(trx->conc_state != TRX_ACTIVE);

if (rseg_id == ULINT_UNDEFINED) { rseg_id = trx_assign_rseg(); }

rseg = trx_sys_get_nth_rseg(trx_sys, rseg_id);

trx->id = trx_sys_get_new_trx_id();

/ The initial value for trx->no: ut_dulint_max is used in read_view_open_now: /

trx->no = ut_dulint_max;

trx->rseg = rseg;

trx->conc_state = TRX_ACTIVE; trx->start_time = time(NULL);

UT_LIST_ADD_FIRST(trx_list, trx_sys->trx_list, trx);

adofsauron commented 1 year ago

Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 0 0: len 6; hex 000000000200; asc ;; 1: len 6; hex 000000000505; asc ;; 2: len 7; hex 800000002d0110; asc - ;; 3: len 4; hex 80000001; asc ;; 4: len 4; hex 80000003; asc ;;

adofsauron commented 1 year ago
INVALID = 0,
// -----------------------------
// Catalog
// -----------------------------
CREATE_TABLE = 1,
DROP_TABLE = 2,

CREATE_SCHEMA = 3,
DROP_SCHEMA = 4,

CREATE_VIEW = 5,
DROP_VIEW = 6,

CREATE_SEQUENCE = 8,
DROP_SEQUENCE = 9,
SEQUENCE_VALUE = 10,

CREATE_MACRO = 11,
DROP_MACRO = 12,

CREATE_TYPE = 13,
DROP_TYPE = 14,

ALTER_INFO = 20,

CREATE_TABLE_MACRO = 21,
DROP_TABLE_MACRO = 22,

CREATE_INDEX = 23,
DROP_INDEX = 24,

// -----------------------------
// Data
// -----------------------------
USE_TABLE = 25,
INSERT_TUPLE = 26,
DELETE_TUPLE = 27,
UPDATE_TUPLE = 28,
// -----------------------------
// Flush
// -----------------------------
CHECKPOINT = 99,
WAL_FLUSH = 100
adofsauron commented 1 year ago
FieldWriter writer(serializer);
writer.WriteString(GetSchemaName());
writer.WriteString(GetTableName());
writer.WriteString(name);
writer.WriteString(sql);
writer.WriteField(index->type);
writer.WriteField(index->constraint_type);
writer.WriteSerializableList(expressions);
writer.WriteSerializableList(parsed_expressions);
writer.WriteList<idx_t>(index->column_ids);
writer.Finalize();
adofsauron commented 1 year ago
    while (true) {
        // read the current entry
        WALType entry_type = initial_reader->Read<WALType>();
        if (entry_type == WALType::WAL_FLUSH) {
            // check if the file is exhausted
            if (initial_reader->Finished()) {
                // we finished reading the file: break
                break;
            }
        } else {
            // replay the entry
            checkpoint_state.ReplayEntry(entry_type);
        }
    }
adofsauron commented 1 year ago

if there is a checkpoint flag, we might have already flushed the contents of the WAL to disk

adofsauron commented 1 year ago

Uint32 m_operationType; // 0 READ, 1 UPDATE, 2 INSERT, 3 DELETE

adofsauron commented 1 year ago

fbf772c61f96e68efd12a0f0ffd632f4