summarize: LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network

LinnOS

The paper introduces LinnOS, an operating system that uses a small neural network to predict SSD performance at a detailed level, improving parallel storage application performance. LinnOS works with black-box devices and real-world data without user input, outperforming other methods. Evaluation showed LinnOS improved I/O latencies by 9.6-79.6% with 87-97% accuracy and 4-6µs overhead per I/O, demonstrating the feasibility of using machine learning in real-time decision-making in operating systems.

Introduction Predictable performance and low latency are crucial requirements for data center systems serving web search, email, and other interactive services. Despite the advancements in SSD technology, achieving highly predictable latency on modern flash devices remains a challenging problem due to the increasing complexity and intrinsic idiosyncrasies of NAND flash. Operations like garbage collection, buffer flushing, wear levelling, and read repairs can negatively impact latency predictability.

To address this issue, researchers have taken three approaches: "white-box" methods that re-architect device internals, "gray-box" methods that suggest partial device-level modification combined with OS or application-level changes, and "black-box" techniques that attempt to mask unpredictability without modifying the underlying hardware or abstraction level. However, the most widely adopted solution is speculative execution due to its simplicity.

The authors propose a new approach - LinnOS, an operating system that can learn the behavior of the underlying flash device in a black-box way and use the results of the learning to increase predictability. LinnOS helps storage arrays and clusters achieve extreme latency predictability using a lightweight neural network. The key challenge for LinnOS is to be as effective as speculative execution, which increases predictability by sending a duplicate I/O to another node or device but incurs poor resource utilization.

LinnOS introduces three technical contributions: converting the latency inference problem into a simple binary inference, taking advantage of the typical latency distributions, and implementing a lightweight neural network to make the inference highly usable. To the best of the authors' knowledge, there is no existing learning approach for I/O scheduling that supports fine-grained learning and fast online inference on a per-I/O scale.

Background The authors discuss the problem of unpredictability in read latency distribution in read-write workloads on different SSD models. The authors found that write latencies are stable, but read latencies are affected by internal complexities such as contending with garbage collection, buffer flushing, and other internal operations. Inferring the latency behavior is difficult due to these complexities. Traditional storage applications apply a "wait-then-speculate" approach, but speculative execution is ineffective for flash storage with expected response times less than a few milliseconds. The authors attempted to use simple heuristics and machine learning techniques, but they did not yield satisfactory results.

Overview LinnOS is a system for parallel, redundant storage that helps improve I/O latency. It does this by tagging critical I/Os with a one-bit flag, then using a trained neural network to make a binary inference (fast or slow) before submitting the I/O to the underlying SSD. If the inference is slow, the I/O is revoked and a "slow" error code is returned. The storage application can then failover to another replica. The overall architecture of LinnOS includes a speedy inference model, tracing of live workload, labelling with inflection point analysis, training, and uploading weights. The model is trained using a supporting application (LinnApp), which labels the traced I/Os and runs the training phase using TensorFlow. The model's input features are information about outstanding and recently completed I/Os, and its output is the binary inference about the I/O. The challenges of using a machine learning approach for making online, fine-grained inferences on I/O speed are high accuracy, fast inference, and anticipating heterogeneity. High accuracy depends on careful output labeling and input features selection, and a simple two-class approach (fast or slow) helps the model achieve high accuracy. Fast inference requires balancing accuracy and performance and the models must be CPU-friendly. Anticipating heterogeneity requires collecting per-device traces and training the model for every load-device pair in the array, and occasionally retraining the model to check for workload changes over time.

LinnOS Design LinnOS is an operating system that solves the challenge of accurately inferring I/O speed in real-time. The secret to its success is the use of a lightweight neural network model. The design process of LinnOS is explained chronologically from data collection, labeling via inflection point analysis, model design, improvement of accuracy and performance, and finally, the summarization of its advantages. The data for the training process is collected by tracing the real workload running on the drive, and collecting five raw fields for each I/O such as submission time, block offset, block size, read/write, and completion time. The traces are collected for a busy-hour, and in case of a dramatic shift in workload behavior, retracing and retraining can be done. The model is trained with labels, and to make it accurate, the latencies are observed to form a Pareto distribution with a high alpha number. To separate the fast and slow regions, the best inflection point is found, which maximizes the latency reduction. LinnOS is able to achieve high accuracy and performance, and its advantages are discussed in detail in the design process.

Light NN model The model infers I/O speed based on three inputs: (1) the number of pending I/Os in 4KB pages, (2) the latency of the 4 most recently completed I/Os, and (3) the number of pending I/Os when those 4 I/Os arrived. The first feature is straightforward as I/O latency is typically correlated with the number of pending I/Os. The other two features are necessary for SSDs as they provide historical information to determine if the SSD is busy internally. Unnecessary features such as block offsets, read/write flags, and history of writes were removed as they did not improve accuracy. These findings simplify the model by reducing its overhead.

The authors chose the right input format for feeding the neurons of a fully-connected neural network with three layers: an input/preprocess layer, a hidden layer with 256 neurons using RELU activation, and an output layer with two neurons using linear activation and an argmax operator to convert to binary decisions. The design was iterated several times, but resulted in heavy models and high inference overhead, before finally switching to aggregate features. The network is lightweight and balances accuracy and performance.

To improve the accuracy of the model in LinnOS, the authors perform false submit reduction through biased training, model recalibration by retracing/retraining, and inaccuracy masking with high percentile hedging. The authors found that reducing false submits (a false negative) is more important as it results in a request being "stuck" in the device, while false revokes (false positive) are more tolerable. To account for potential shifts in workload and latency distributions, the inflection point is re-computed periodically and the model is retrained if the shift is significant. To improve inference time, the authors optimize the 3-layer design of the deep neural network (DNN) by using quantization for floating point weights and using optimized code for DNN computation.

LinnOS is a system that delivers advantages in various dimensions for storage applications. It achieves predictable performance on flash arrays and helps storage applications achieve predictable performance. It also automates the process by learning from millions of I/Os and producing neuron weights for different workloads and devices. LinnOS does not require device-level modification or heavy redesign of file systems or applications. It also supports fast inference and auto-revocation, which eliminates duplicate I/Os and simplifies the process for the user.

Evaluation The authors evaluate the LinnOS system for improving flash array latencies in real-life scenarios. The evaluation uses traces from Microsoft Azure, Bing Index, Bing Select, and Cosmos servers, with a total of 6 devices per server type. For performance evaluation, 2 flash arrays with consumer and enterprise configurations were prepared. The experiments were conducted using a storage application executing the traces on the flash arrays. The results are compared to 8 methods including baseline, cloning, constant-percentile hedging, inflection point hedging, simple heuristic, advanced heuristic, LinnOS (by itself), and LinnOS with high-percentile hedging. The experiments were repeated 3 times and showed no significant variance.

LinnOS's success in achieving extreme latency predictability in flash storage arrays is discussed. The results show that the LinnOS method combined with high-percentile hedging (LinnOS+HL) consistently outperforms all other methods in reducing average latencies by 9.6-79.6% compared to p95 hedging and by 10.7-71.2% compared to an advanced heuristic. LinnOS by itself (LinnOS-Raw) is effective enough and reduces latency by 0.3-62.3% compared to p95 hedging. Hedging at p95 is a popular method but is generally slower than LinnOS+HL or hedging at IP, which shows a mixed result but is more effective than Hedge95 for most workloads. The simple heuristic (HeurSim) has poor results and is outperformed by the other methods.

LinnOS’s inaccuracies are measured by counting false submits and false revokes. The experiments can only measure false submits. Inaccuracies are measured offline using TensorFlow with real data. The results show that biased training successfully lowers false submit rates to 0.7-5.7% while increasing false revokes to 2.8-9.7%. However, only a small probability of false revokes results in slow I/Os. Combining LinnOS with high-percentile hedging led to improved results.

Trade-offs between accuracy and inference overhead were evaluated for different models (A-E) applied to storage I/O operations. Lowering the number of input features and using a smaller model can reduce overhead but result in accuracy loss, while increasing the number of features and using larger models can improve accuracy but increase overhead. LinnOS’s performance on public traces and MongoDB on different file systems is also evaluated, showing improved performance with LinnOS. The CPU overhead of LinnOS is low, with up to 0.7% resource usage. Utilizing an additional CPU core can further reduce inference overhead by 36%.

To conclude, the article presents LinnOS, a new operating system that is capable of inferring the speed of I/O to flash storage. It has been shown that LinnOS outperforms other methods and brings predictability to unpredictable flash storage. However, there are questions about performance, machine learning accuracy, other integrations and extensions, and precision that need to be addressed in future workThe authors believe that LinnOS could have a significant impact on other higher and lower layers in the future.

tapaswenipathak / linux-kernel-stats

summarize: LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network #123