jackh726 / bigtools

A high-performance BigWig and BigBed library in Rust
MIT License
70 stars 7 forks source link

Add `--clip` argument to `bedgraphtobigwig` #58

Open nleroy917 opened 3 days ago

nleroy917 commented 3 days ago

Hey @jackh726!

As mentioned in #56, this PR adds a new option to the bedgraphtobigwig binary that will "skip" any chromosomes in the .bedGraph file that are not represented in the given chrom.sizes file. This functionality is inspired by the kent utils -clip option in wigToBigWig.

I appreciate you letting me take a stab at this. Currently... it's unfinished. It is not working properly, but I thought I'd open it and get your immediate feedback. I think I'm getting stuck in an infinite loop and will continue working.

Anyways, after getting my bearings, I was able to quickly add --clip to BBIWriteOptions. I had some troubles deciding how to implement the skip logic, however.

I believe that I was able to find the source of the error... The do_read that gets passed as the start_processing argument in process_to_bbi will return the ProcessDataError::InvalidChromosome error if the chromosome is not found in chrom.sizes. I had three ideas to go forward:

  1. Add a fourth argument called clip to the process_to_bbi function in the BBIDataSource trait. Then, inside match the error and skip if clip is true. I didn't like this since it felt too specific for the trait.
  2. Add a new error called ProcessDataError::ClippedChromosome. If clip is true, return this instead of the current ProcessDataError::ClippedChromosome. Likewise, this can be matched inside process_to_bbi and enable a "skip"-like feature. This felt weird since its not really an error.
  3. Add a boolean field to ProcessDataError::InvalidChromosome. Again, match on this error and that field being true, skip the chromosome.

I went with option three since it seemed the least hacky and most idiomatic/rusty way. I'm curious to your thoughts on this and it seems the real trick is just getting those BBIWriteOptions into process_to_bbi.

Will continue working on the infinite loop, so stay tuned. Thank you!

jackh726 commented 2 days ago

Hi @nleroy917, nice to see a PR :)

Of your three options, 1 definitely does seem like not the right option. 2 and 3 are similar, in that you're modifying BBIProcessError to capture the "don't process this chromosome" state. I would offer one other option: make StartProcessing be FnMut(String) -> Result<Option<P>, ProcessDataError>, where Ok(None) means "no error, but don't process this chromosome.

One important note: in the BedParserStreamingIterator impl, you will still have to read through the bed file past the skipped chromosome. The BedParserParallelStreamingIterator impl can just be continued since each chrom gets a separate File open+seek.

I have a couple other thoughts:

First, the naive thing to do would be just "pretend" everything is fine to process_to_bbi. This would really only require faking the length by setting to u32::MAX. Then, later when actually writing the data, skip the chromosomes. This has the benefit that proces_to_bbi stays agnostic to clipping. But, has a few major downsides. Particularly that a lot of extra work is done for no reason, and that the chrom will get an id assigned to it. However, it seems appealing because...

Second, it's not a big deal, but so far process_to_bbi hasn't actually cared about the chromosome length at all. With this option, it doesn't need to, since this check can (and should) just be done in process_val (btw, I'm noticing that bigbedwrite::process_val has the wrong check: current_val.start >= chrom_length instead of current_val.end >= chrom_length - can you fix that when you touch that bit of code?). But, it does split the handling of clip into two places. An alternative is to move those checks into process_val, but I don't like that option particularly much, since now implementations of BBIDataProcessor have to duplicate value validation. And, I think just having a way to tell process_to_bbi to skip a chromsome is fine if the actual reason behind that is agnostic.

Thoughts?