sstadick / gzp

Multi-threaded Compression
The Unlicense
154 stars 14 forks source link

Help with the bgzf reader? #28

Closed mrvollger closed 2 years ago

mrvollger commented 2 years ago

Hello,

I am newish to rust and I have what is probably a very simple question. How do I read in line by line a bgzipped file?

I have this code which is largely borrowed from crabz (see bottom), but once I have the BgzfSyncReader I am not sure how to iterate over it, or manipulate it in any way.

Thanks in advance! Mitchell

/// Get a buffered input reader from stdin or a file
fn get_input(path: Option<PathBuf>) -> Result<Box<dyn BufRead + Send + 'static>> {
    let reader: Box<dyn BufRead + Send + 'static> = match path {
        Some(path) => {
            if path.as_os_str() == "-" {
                Box::new(BufReader::with_capacity(BUFFER_SIZE, io::stdin()))
            } else {
                Box::new(BufReader::with_capacity(BUFFER_SIZE, File::open(path)?))
            }
        }
        None => Box::new(BufReader::with_capacity(BUFFER_SIZE, io::stdin())),
    };
    Ok(reader)
}

/// Example trying bgzip
/// ```
/// use rustybam::myio;
/// let f = ".test/asm_small.paf.bgz";
/// myio::test_gbz(f);
/// ```
pub fn test_bgz(filename: &str) {
    let ext = Path::new(filename).extension();
    eprintln!("{:?}", ext);
    let pathbuf = PathBuf::from(filename);
    let box_dny_bufread_send = get_input(Some(pathbuf)).unwrap();
    let gbzf_reader = BgzfSyncReader::new(box_dny_bufread_send);
    // How do I now loop over lines?
    for line in gbzf_reader {
        eprintln!(line);
    }
}
sstadick commented 2 years ago

Hi! Thanks for making an issue!

There are two pieces here:

  1. BgzfSyncReader implements the Read trait.
  2. The [BufRead]() trait brings into scope a set of methods that allow for reading lines on a BufReader

So, you just need to wrap the BgzfSyncReader in a BufReader. Which is a bit redundant and I should really implement BufRead for the BgzfSyncReader.

  1. Import std::io::BufRead to bring into scope the line reading methods:
use std::io::BufRead;
let mut reader = BufReader::new(bgzf_reader);

// The lines method will create a new string allocation for each new line
for line in reader.lines() {
    // do stuff
}

// Reuse a line buffer, this still has to copy bytes from the underlying reader into the buffer
let mut buffer = String::new();
loop {
    if let Some(bytes_read) = reader.read_line(&mut buffer) {
        if bytes_read == 0 { break }
    }
    // do stuff
    buffer.clear()
}
  1. (nearly) Zero copy line iteration with ripline. This is much faster, but kind of a pain, see the example in the readme. It takes something that implements [Read] only and manages its own internal buffer of lines.

So, to fill out your example using the most basic read line method:

use std::{
    error::Error,
    fs::File,
    io::{self, BufRead, BufReader},
    path::{Path, PathBuf},
};

use gzp::BgzfSyncReader;
const BUFFER_SIZE: usize = 1024 * 64;
type DynResult<T> = Result<T, Box<dyn Error + 'static>>;

/// Get a buffered input reader from stdin or a file
fn get_input(path: Option<PathBuf>) -> DynResult<Box<dyn BufRead + Send + 'static>> {
    let reader: Box<dyn BufRead + Send + 'static> = match path {
        Some(path) => {
            if path.as_os_str() == "-" {
                Box::new(BufReader::with_capacity(BUFFER_SIZE, io::stdin()))
            } else {
                Box::new(BufReader::with_capacity(BUFFER_SIZE, File::open(path)?))
            }
        }
        None => Box::new(BufReader::with_capacity(BUFFER_SIZE, io::stdin())),
    };
    Ok(reader)
}

/// Example trying bgzip
/// ```
/// use rustybam::myio;
/// let f = ".test/asm_small.paf.bgz";
/// myio::test_gbz(f);
/// ```
pub fn test_bgz(filename: &str) {
    let ext = Path::new(filename).extension();
    eprintln!("{:?}", ext);
    let pathbuf = PathBuf::from(filename);
    let box_dny_bufread_send = get_input(Some(pathbuf)).unwrap();
    let gbzf_reader = BufReader::new(BgzfSyncReader::new(box_dny_bufread_send));

    for line in gbzf_reader.lines() {
        eprintln!("{}", line.unwrap());
    }
}

fn main() {
    println!("Hello, world!");
}
sstadick commented 2 years ago

To clarify, you have identified an issue with gzp which is that iterating over lines requires double buffering since it doesn't implement BufRead on its own even though it really could.

mrvollger commented 2 years ago

Thank you so much for this worked out example, it is very helpful!!! One last question, is there an easy/standard way to test whether an input file is gziped or bgziped using gzp.

Thanks again for this awesome tool and the quick responce.

sstadick commented 2 years ago

There is not currently any way to check the first few bytes of a file to check if it's compressed or not. In other applications I just do the simple thing and look at incoming file extensions or require that a CLI arg be passed in to indicate the input stream is compressed.

EX: https://github.com/sstadick/perbase/blob/500bd0c83b342d23cfd78dd58cb591d9e12a60fb/src/lib/utils.rs#L61

But this is something I intend to fix in the future!

mrvollger commented 2 years ago

Got it, thanks so much for the help!