Storage with abbility to convert random writes into sequential writes

andrii0lomakin commented 5 years ago

The current size of pages which we use now is 64Kb. This page size is not optimal taking into account the ratio between the latency and speed of SSD. The optimal size of the page right now is about 4Kb. Also, let suppose we need to read records from the cluster to handle graph query initiated by the user and those records are not cached. Usage of 64Kb pages introduces a lot of the read amplification compared to the usage of 4Kb pages. There is problem thought. Our cache data structures use Java heap and migration to 4K pages will put tremendous overhead on GC. So to overcome the issue we may decrease the size of the page till 16K and use LZ4 compression to decrease the size of the page up to 4K. From our tests reaching of compression ratio equals to 4 or bigger is not an exception for our data.
Because we are going to use page compression it means that we should handle pages of variable sizes. So for that, we need to introduce a map between the logical page index and physical page index. Which in nutshell would be an array list. Usage of such a map is not something exceptional, Microsoft successfully uses it in its own databases. Usage of such page map also provides us with the capability to convert random writes into the sequential writes and perform defragmentation of file segments on the fly. On the average speed of sequential writes for SSD is about 5 times faster than the speed of random writes. So if we split storage by segments of size of about 64Mb then we can reuse any segments which are at least 20% empty. So in the worst case, a new implementation will consume 20% more space. But taking into the account that:

For many practical cases, we observe the existence of write skew pattern, when only a few amounts of data are changed often, but the rest is untouched. So we should observe the existence of segments with a big amount of free space.
Because of the good level of compression ratio, we can afford to keep a substantial amount of free space not used into segments until we do not reach a ratio till amount of free space in segments do not reach current compression ratio. We expect that percent of data which are needed to be defragmented when start to reuse existing segment will be quite small compared to new data written and we will observe substantial speed improvement.

Also usage of storage which incorporate GC to deal with invalidated pages and support of pages of variable size allows to do not load pages for write if they do not present in cache but log operations on those pages and then merge logged operations and actual pages togather during GC phase.

As additional improvement of speed up of graph queries we are going to issue batch of async io requests to load pages not present into the cache in parallel.

andrii0lomakin commented 4 years ago

Unfortunately memory overhead of page mapping in Java would be overwhelming, we are going to use the change buffer concept which was implemented in MySQL.

andrii0lomakin commented 4 years ago

Closed because all changes will be implemented inside of https://github.com/orientechnologies/orientdb/issues/9029

orientechnologies / orientdb

Storage with abbility to convert random writes into sequential writes #9028