sstadick / ripline

Fast by-line reader from ripgrep
The Unlicense
12 stars 0 forks source link

🌊 ripline

Build Status license Version info
This is not the greatest line reader in the world, this is just a tribute.

Fast line based iteration almost entirely lifted from ripgrep's grep_searcher.

All credit to Andrew Gallant and the ripgrep contributors.

Why?

Not all of this functionality was exposed in the grep_searcher crate, and rightly so as a lot of it had grep specific configurations embeded into the logic (i.e. binary detection).

What have I changed?

Not much. I took out some of the ripgrep specific logic such as the binary detection, some search related configs, and consolidated a few of the helper stucts from the other grep_* crates.

Example

See examples for more.

use grep_cli::stdout;
use ripline::{
    line_buffer::{LineBufferBuilder, LineBufferReader},
    lines::LineIter,
    LineTerminator,
};
use std::{env, error::Error, fs::File, io::Write, path::PathBuf};
use termcolor::ColorChoice;

fn main() -> Result<(), Box<dyn Error>> {
    let path = PathBuf::from(env::args().nth(1).expect("Failed to provide input file"));

    let mut out = stdout(ColorChoice::Never);

    let reader = File::open(&path)?;
    let terminator = LineTerminator::byte(b'\n');
    let mut line_buffer = LineBufferBuilder::new().build();
    let mut lb_reader = LineBufferReader::new(reader, &mut line_buffer);

    while lb_reader.fill()? {
        let lines = LineIter::new(terminator.as_byte(), lb_reader.buffer());
        for line in lines {
            out.write_all(line)?;
        }
        lb_reader.consume_all();
    }

    Ok(())
}

Crude and untrustworthy benchmarks

From examples/ripline_benchmarks.rs. Initial benchmark script take from rust-linereader, which is also included in the benchmarks as LR:*.

The input used was all_train.csv, unzipped can catted together five times createing a ~25G file.

Method Time Lines/sec Bandwidth
read() 2.01s 17439155/s 12303.42 MB/s
LR::next_batch() 2.11s 16576174/s 11694.59 MB/s
LR::next_line() 2.65s 13196734/s 9310.37 MB/s
ripline_line_buffer() 2.64s 13277194/s 9367.14 MB/s
ripline_mmap() 2.16s 16183503/s 11417.55 MB/s
bstr_for_line() 2.47s 14174502/s 10000.19 MB/s
read_until() 2.86s 12230594/s 8628.75 MB/s
read_line() 4.16s 8417415/s 5938.53 MB/s
lines() 5.05s 6930901/s 4889.79 MB/s

Note that read and next_batch are not counting lines. read_until() doesn't seem to perform as well in real-life scenarios as it does on this benchmark and I'm not sure why.

Hardware: Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive