Adding to biofast? - Githubissues

jelber2 commented 2 years ago

Have you thought about adding rust-lapper to https://github.com/lh3/biofast?

sstadick commented 2 years ago

I have! However, rust-lapper (and all lapper implementations) do have a glaring worst case when there is an interval in the set that encloses all or many other intervals in the set. It's not strictly an interval tree. It's just a sorted list with binary search.

So, for the biofast purposes of evaluating programming languages I think rust-lappers algorithm is too different from the rest of the implementations to demonstrate that rust and not the algorithm are fast.

That said it would be interesting just to see the numbers / have a concrete thing to benchmark against... I'd be happy to host the results of rust-lapper vs the biofast repo here in the rust-lapper repo. PRs very welcome :)

sstadick commented 2 years ago

Also, if you are looking to get Rust into the biofast repo for interval trees, I believe rust-bio has implemented the same interval tree as the C impl in biofast: https://docs.rs/bio/0.38.0/bio/data_structures/interval_tree/struct.ArrayBackedIntervalTree.html

jelber2 commented 2 years ago

@sstadick thanks for the interesting and fast response! I might try incorporating a rust-lapper benchmark using the biofast bed file data sets (https://github.com/lh3/biofast/releases/tag/biofast-data-v1, but no promises). I consider myself a tool manipulating primate, rather than a tool making primate, so me actually trying to get rust into the biofast repo for using rust-bio's ArrayBackedIntervalTree might be a challenge, but I will try! It will be my first dive into rust development! I know that I certainly could not write rust-bio's ArrayBackedIntervalTree library.

sstadick commented 2 years ago

Absolutely! Happy to help if you run into any issues.

Here's an example of reading a bedfile into a Vec<Lapper<_>>: https://github.com/sstadick/perbase/blob/69e7b1044f55b451cdbfead20f3a2d728f98c0a0/src/lib/par_granges.rs#L268 where the vec index is tied to the the TID from a BAM header (but could totally be replaced with a hashmap as suggested in the rust-bio issue).

jelber2 commented 2 years ago

Ok, this will certainly take some time on my part, but I guess I keep this open for now.

jelber2 commented 2 years ago

Well, this is what I did so far with rust-bio, it is really more of a bedtools intersect and not bedtools coverage implementation. I am just leaving this here for now. Not sure I will have time to make it into a proper bedtools coverage program.


use std::convert::TryInto;
use std::path::PathBuf;
use bio::data_structures::interval_tree::IntervalTree;
use std::collections::HashMap;
use bio::io::bed;
use std::env;
use std::fs::File;

fn main() {
    // Get the first bed file's path
    // if missing, then provide error
    let path1: PathBuf = match env::args().nth(1) {
        Some(p1) => PathBuf::from(p1),
        _ => {
            eprintln!("Usage: {} <in1.bed> <in2.bed>", file!());
            std::process::exit(1)
        }
    };

    // Get the second bed file's path
    // if missing, then provide error 
    let path2: PathBuf = match env::args().nth(2) {
        Some(p2) => PathBuf::from(p2),
        _ => {
            eprintln!("Usage: {} <in1.bed> <in2.bed>", file!());
            std::process::exit(1)
        }
    };

    // Open the first bed file's path
    let file1 = match File::open(&path1) {
        Ok(fh1) => fh1,
        Err(err) => {
            eprintln!("Failed to open file: {}", err);
            std::process::exit(1)
        }
    };

    // Open the second bed file's path
    let file2 = match File::open(&path2) {
        Ok(fh2) => fh2,
        Err(err) => {
            eprintln!("Failed to open file: {}", err);
            std::process::exit(1)
        }
    };

    let mut reader1 = bed::Reader::new(file1);

    // make the HashMap
    let mut trees = HashMap::new();

    // Make a vector of contigs

    let mut contigs = Vec::new();

    // Get the unique contigs
    for record in reader1.records() {
        let rec = record.expect("Error reading record.");
//        println!("rec.start() = {}, rec.end() = {}, rec.chrom() = {}", rec.start(), rec.end(), rec.chrom().to_string());
        contigs.push(rec.chrom().to_string());
    }

    contigs.sort_by(|a, b| a.partial_cmp(b).expect("NaN in vector"));
    contigs.dedup();

    // Open the first bed file's path again
    // I guess this in needed but not extensively tested
    let file1 = match File::open(&path1) {
        Ok(fh1) => fh1,
        Err(err) => {
            eprintln!("Failed to open file: {}", err);
            std::process::exit(1)
        }
    };

    let mut reader1 = bed::Reader::new(file1);
    //read through each unique contig
    // and store the intervals in a vector
    for contig in contigs {
//        println!("contig = {}", contig);
        let mut currentcontig = Vec::new();
        for record1 in reader1.records() {
            let rec1 = record1.expect("Error reading record.");
//            println!("rec1.start() = {}, rec1.end() = {}, contig = {}", rec1.start(), rec1.end(), contig);
            if rec1.chrom().to_string() == contig {
                currentcontig.push(rec1.start()-1);
                currentcontig.push(rec1.end()+1);
//                println!("rec1.start() = {}, rec1.end() = {}, contig = {}", rec1.start(), rec1.end(), contig);
            }
        }
        // make a new IntervalTree
        let mut tree = IntervalTree::new();
        let mut i=0;
        // add each bed interval to the IntervalTree
        for s in currentcontig.windows(2) {
            i+=1;
            if i % 2 == 0 {
                // n is even
            }
            else {
                let [a, b]: [u64; 2] = s.try_into().unwrap();
//                println!("{}\t{}", a, b);
                tree.insert(a..b, contig.to_string());
            }
        }    
        trees.insert(contig.to_string(),tree);
    }

    let mut reader2 = bed::Reader::new(file2);

    // Read through second bed file entry by entry and then
    // and try to find overlaps

    for record2 in reader2.records() {
        let rec2 = record2.expect("Error reading record.");
        if let Some(tree) = trees.get(&rec2.chrom().to_string()) {
            for r in tree.find(rec2.start()..rec2.end()) {
                println!("{}\t{}\t{}", r.data(), r.interval().start+1, r.interval().end-1);
            }
        }
    }
}

sstadick / rust-lapper

Adding to biofast? #13