quinlan-lab / bedder-rs

an API for intersections of genomic data
MIT License
74 stars 0 forks source link

how to `Write` or `Display` interval types #9

Open brentp opened 10 months ago

brentp commented 10 months ago

We could use std::fmt::Display but what if we have VCF input and want to write, e.g. BCF output.

Currently, we're using standard outputting, but should likely switch to writers. This will require tracking that, for example, we can have a BCF writer from a VCF query, or a compressed (BGZF) writer from a VCF, but not a BAM writer from a VCF query.

The user should be able to specify, e.g. --output-format BAM | CRAM | SAM or --output-format BCF | VCF along with a --compression-level = 0..9

For BED, perhaps it's --output-format BED12 | BED3 | ??? where perhaps ??? is a template that can be filled.

I need to think about how how to implement this and to look at changes to noodles in 0.61.0 that unify BAM/CRAM/SAM reading.

brentp commented 9 months ago

now, we get a Report struct here in the example.

Instead of handling all of the branching there (in the example code), we probably want to have something like:

writer = Writer::init(&args)
...
for intersection ... {
    let report = intersection.report(...)
    writer.write(report)
}

so what goes to init() instead of args? Could be:

use sniff::FileFormat;
impl Writer {
    pub fn init(in:FileFormat, out:FileFormat) -> Result<Writer, FormatCompatibilityError> { ... }
}

this can be a good start, but will need a way to pass all of the options, for example the option to count overlaps should be known by the writer. But this can be in an additional struct.

brentp commented 8 months ago

The above seems good enough to proceed to specify input format X to output format Y. But we also need to handle adding extra columns. For example, if the user asked for counts (-c in bedtools). In VCF/BCF, this would go into the INFO field with, for example counts=4,5,6 which also requires updating the header of the VCF.

For BAM, it can go to tags and for bed, it is appended as a column.

Who is responsible for keeping that information?

Is it:

let wtr = Writer::init(in_fmt, out_fmt, compression, args.count, args.count_bases, ...)

or

wtr.write(report, count, count_bases)

but then any possible additional count or column must be an argument to these. We can likely special-case those and also accept a function:

impl Writer {
    pub fn init(
        in_fmt: FileFormat,
        out_fmt: Option<FileFormat>,
        compression: Compression,
        column_fn: fn(Option<&report>) -> String
)

with the convention that when the argument is None, the function returns the name (for VCF), e.g. count so that it can be added to the header. But we also need to know the type for the VCF header... So then we can instead accept the trait:

pub trait ColumnReporter {
   /// report the name, e.g. `count` for the INFO field of the VCF 
   fn name(&self) -> String
   /// report the type, for the INFO field of the VCF
   fn type(&self) -> Type // Type is some enum from noodles or here that limits to relevant types
   fn description(&self) -> String
   fn number(&self) -> Number // Number is some enum ...

  fn value(&report) -> Value // Value probably something from noodles that encapsulates Float/Int/Vec<Float>/String/...
}

This makes it more work to implement, but I think it's worth it and actually not too onerous. With this, Writer.write can be:

impl Writer {
    ...
    pub fn write(r: &Report, Vec<dyn ColumnReporter>) {...}
}
brentp commented 7 months ago

This is started in 9e6f440

It's necessary to have a way to link each ColumnReporter which is responsible for the type and name metadata with the files in the Report. Report is Vec<ReportFragment> and that is:

pub struct ReportFragment {
    pub a: Option<Position>,
    pub b: Vec<Position>,
    pub id: usize,
}

so we might want

    pub fn write(&self, report: &Report, crs: Vec<Vec<Box<dyn ColumnReporter>>>) {

where the outer vec is has length of the number of b files and each of those has a length equal to the number of columns from that file. This can get verbose, when we want a simple thing like the 3 bed columns, then we need:

struct BedReporter {
   name_: String,
   idx: usize,
   type: Type,
   desc: String,
  }

impl ColumnReporter for BedReporter {
    fn name(&self) { self.name }
    fn ftype(&self) { self.type }
    ...
   fn value(&self,  r: &ReportFragment) -> Result<Value, ColumnError> {
      r.b.map(|p| {
        match p {
            Position::Bed(br) => match self.idx {
                  0 => p.chrom(),
                   1 => p.start(),
                   2 => p.stop()
            },
            _ => Error
        }
      })
  }
}

This is very flexible, but would get quite complicated just to extract a single column

brentp commented 7 months ago

I am exploring using lua snippets again as this seems flexible and user-friendly enough. The cost is performance, but I think that's not the main concern.