bedGraph to bigWig: zoom levels not respected when reading from stdin vs a .bedGraph file

Hey @jackh726,

I wanted to report an inconsistency I am encountering. I am currently attempting to stream values to the bigwig writer (where the reader input is bedGraph, BedParserStreamingIterator::from_bedgraph_file) and am noticing that the zoom levels are not matching the what is given via the BBIWriteArgs.

I noticed I can reproduce this on the CLI. For example, if I have a bedGraph file, running:

./target/release/bedgraphtobigwig /home/my_bedgraph.bedGraph /hg38/fasta/default/hg38.chrom.sizes /home/my_bw_output.bw -z 6

works fine. I can confirm the zoom levels are appropriate by using bigwiginfo --chroms

If I set the input to be "stdin", the zoom levels are not respected, so:

cat /home/my_bedgraph.bedGraph  | ./target/release/bedgraphtobigwig - /hg38/fasta/default/hg38.chrom.sizes  home/my_bw_output.bw -z 6

does not set the zoom level appropriately.

Do you happen to have any thoughts on this?

For clarity, here is my setup for creating the BigWigWrite object:

pub fn create_bw_writer(
    chrom_sizes_ref_path: &str,
    new_file_path: &str,
    num_threads: i32,
    zoom: i32,
) -> BigWigWrite<File> {
    let bedgraphargstruct = BedGraphToBigWigArgs {
        bedgraph: String::from("-"),
        chromsizes: chrom_sizes_ref_path.to_string(),
        output: new_file_path.to_string(),
        parallel: "auto".to_string(),
        single_pass: false,
        write_args: BBIWriteArgs {
            nthreads: num_threads as usize,
            nzooms: zoom as u32,
            zooms: None,
            uncompressed: false,
            sorted: "start".to_string(),
            block_size: 256,      //default
            items_per_slot: 1024, //default
            inmemory: false,
        },
    };
    let chrom_map: HashMap<String, u32> =
        BufReader::new(File::open(bedgraphargstruct.chromsizes).unwrap())
            .lines()
            .filter(|l| match l {
                Ok(s) => !s.is_empty(),
                _ => true,
            })
            .map(|l| {
                let words = l.expect("Split error");
                let mut split = words.split_whitespace();
                (
                    split.next().expect("Missing chrom").to_owned(),
                    split.next().expect("Missing size").parse::<u32>().unwrap(),
                )
            })
            .collect();

    let mut outb: BigWigWrite<File> =
        BigWigWrite::create_file(bedgraphargstruct.output, chrom_map).unwrap();
    outb.options.max_zooms = bedgraphargstruct.write_args.nzooms;
    outb.options.manual_zoom_sizes = bedgraphargstruct.write_args.zooms;
    outb.options.compress = !bedgraphargstruct.write_args.uncompressed;
    outb.options.input_sort_type = InputSortType::START;
    outb.options.block_size = bedgraphargstruct.write_args.block_size;
    outb.options.inmemory = bedgraphargstruct.write_args.inmemory;

    outb
}

So, just took a quite look over and I don't think anything unintentional is happening here. The ways zooms are picked, calculated, and written is complicated. And, the behavior is different depending on whether you are writing in a "single pass" (which is required with stdin) or over multiple passes (enabled by default when reading from a bed file).

First, let me say that all of the following should be able to be overridden by the recently added manual_zoom_sizes (--zooms) option. If you find that using that option does not result in those exact resolutions being written to the file, please let me know because that's a bug.

Ultimately, the nzooms option (-z) is the max zooms that may be written to a file (unless using the option above). The actual zoom resolutions that get written to the file may be lower because (if you want the details):

The zoom resolutions are upper bounded based on actual data size of the zooms (if the size of the zoom section for a given resolution is greater than half the size of the full data section, that resolution is not written. Additionally, a zoom may be skipped if the index size has not decreased from previous zooms (decreasing in resolution). (I'm not sure actually how often these comes up, but it matches the behavior of the ucsc tools.)

To select the zoom resolutions, it's different between single-pass and multi-pass.

In the single-pass mode, initial resolutions are selected started at the initial_zoom_size (default 160 bp, not able to be set with cli currently only via the API) and incremented by 4x successively up to the max zooms. As the input file is read, zoom data is calculated on the fly. Subject to the restrictions above, all zooms are written to the file (but, importantly, if any zooms are skipped because of the above constraints, there will be few than the max zooms because no other resolutions were calculated.)

In the multi-pass mode, the initial resolutions are calculated by the data in the file. On the initial pass of the input file, two things are counted: 1) the average entry size 2) the expected zoom entries for resolutions starting at 10 bp, increasing by 4x up to max integer size (probably a bit excessive). Then, the zoom resolutions to go back and write are calculated by finding the first zoom greater than four times the average entry size and with an estimated compressed data size less than half the full data. Then, the max zooms are selected and written.

So, ultimately, I think the key bit here in the single-pass mode is:

importantly, if any zooms are skipped because of the above constraints, there will be few than the max zooms because no other resolutions were calculated

The way to fix this is just calculate more than the max zooms when processing the input file. I'm hesitant to do this, since the approach I've basically followed is that the zoom resolutions are essentially a best attempt at capturing summary stats of the data at a range of different levels, without comprising file size (which essentially are what the checks from the ucsc tools are for) or write speed. When in the multi-pass mode, we can get a pretty good guess at what resolutions capture the data across different scales, but in single-pass mode it's just a guess as to how "dense" the bed(Graph) file is. For bedGraph files with near basepair resolution, the 160bp first zoom resolution is a huge jump. But from looking at different files, as well as on the principle that the zoom levels are designed around providing data for visualization in the genome browser, it seemed better to be more liberal with a bigger initial resolution than do all the extra compute when not required.

I'm curious on what your use-case for the zooms are? I would caution if you are using them for data analysis directly. Interpreting data near or across zoom boundaries, for example, is a bit difficult and probably only good for estimates. If you do need exact zoom levels, I expect that using the --zooms option will work, since that should force the exact zooms you specify to be calculated and written.

jackh726 / bigtools

bedGraph to bigWig: zoom levels not respected when reading from stdin vs a .bedGraph file #63