Open nleroy917 opened 3 days ago
Hi @nleroy917, nice to see a PR :)
Of your three options, 1 definitely does seem like not the right option. 2 and 3 are similar, in that you're modifying BBIProcessError
to capture the "don't process this chromosome" state. I would offer one other option: make StartProcessing
be FnMut(String) -> Result<Option<P>, ProcessDataError>
, where Ok(None)
means "no error, but don't process this chromosome.
One important note: in the BedParserStreamingIterator
impl, you will still have to read through the bed file past the skipped chromosome. The BedParserParallelStreamingIterator
impl can just be continue
d since each chrom gets a separate File open+seek.
I have a couple other thoughts:
First, the naive thing to do would be just "pretend" everything is fine to process_to_bbi
. This would really only require faking the length by setting to u32::MAX
. Then, later when actually writing the data, skip the chromosomes. This has the benefit that proces_to_bbi
stays agnostic to clipping. But, has a few major downsides. Particularly that a lot of extra work is done for no reason, and that the chrom will get an id assigned to it. However, it seems appealing because...
Second, it's not a big deal, but so far process_to_bbi
hasn't actually cared about the chromosome length at all. With this option, it doesn't need to, since this check can (and should) just be done in process_val
(btw, I'm noticing that bigbedwrite::process_val
has the wrong check: current_val.start >= chrom_length
instead of current_val.end >= chrom_length
- can you fix that when you touch that bit of code?). But, it does split the handling of clip into two places. An alternative is to move those checks into process_val
, but I don't like that option particularly much, since now implementations of BBIDataProcessor
have to duplicate value validation. And, I think just having a way to tell process_to_bbi
to skip a chromsome is fine if the actual reason behind that is agnostic.
Thoughts?
Hey @jackh726!
As mentioned in #56, this PR adds a new option to the
bedgraphtobigwig
binary that will "skip" any chromosomes in the.bedGraph
file that are not represented in the givenchrom.sizes
file. This functionality is inspired by the kent utils-clip
option inwigToBigWig
.I appreciate you letting me take a stab at this. Currently... it's unfinished. It is not working properly, but I thought I'd open it and get your immediate feedback. I think I'm getting stuck in an infinite loop and will continue working.
Anyways, after getting my bearings, I was able to quickly add
--clip
toBBIWriteOptions
. I had some troubles deciding how to implement the skip logic, however.I believe that I was able to find the source of the error... The
do_read
that gets passed as thestart_processing
argument inprocess_to_bbi
will return theProcessDataError::InvalidChromosome
error if the chromosome is not found inchrom.sizes
. I had three ideas to go forward:clip
to theprocess_to_bbi
function in theBBIDataSource
trait. Then, insidematch
the error and skip ifclip
is true. I didn't like this since it felt too specific for the trait.ProcessDataError::ClippedChromosome
. Ifclip
is true, return this instead of the currentProcessDataError::ClippedChromosome
. Likewise, this can be matched insideprocess_to_bbi
and enable a "skip"-like feature. This felt weird since its not really an error.ProcessDataError::InvalidChromosome
. Again,match
on this error and that field being true, skip the chromosome.I went with option three since it seemed the least hacky and most idiomatic/rusty way. I'm curious to your thoughts on this and it seems the real trick is just getting those
BBIWriteOptions
intoprocess_to_bbi
.Will continue working on the infinite loop, so stay tuned. Thank you!