reverbrain / eblob

Eblob is an append-only low-level IO library, which saves data in blob files. Created as low-level backend for elliptics
GNU Lesser General Public License v3.0
104 stars 29 forks source link

New headers and chunked checksums #111

Open shaitan opened 9 years ago

shaitan commented 9 years ago

Introduction

Each record has header that is presented at blob and at index. Header is binary dump of eblob_disk_control object, it has fixed size and in a blob it is placed at the beginning of the record. Header contains the record meta info, sizes, position etc. If it is not disabled, each record has footer that is binary dump of eblob_disk_footer, it has also fixed size and in a blob it is placed at the end of the record. Footer contains the checksum of the record.

Problems

  1. if we will decide to extend header, we will have to convert all blobs to new header format.
  2. record checksumming depends on record size and takes a lot of time in case of huge record

    Solutions

    1. extendable headers

We can use msgpack with fixed fields for header serialization. In case of header extension, blobs with old header will be available for read, but all new writes will be done in new blobs with new headers. Also while defragmentation it can convert blobs with old headers.

2. checksumming of huge file

We can split file into chunks and checksums each chunk. Also we can add new record flags for records which is checksummed by chunk, escape having to convert current blobs and can convert blobs while defragmentation.

shaitan commented 9 years ago

Chunked checksums are implemented in https://github.com/reverbrain/eblob/pull/131 and https://github.com/reverbrain/elliptics/pull/629