Open brentp opened 10 months ago
now, we get a Report
struct here in the example.
Instead of handling all of the branching there (in the example code), we probably want to have something like:
writer = Writer::init(&args)
...
for intersection ... {
let report = intersection.report(...)
writer.write(report)
}
so what goes to init()
instead of args
? Could be:
use sniff::FileFormat;
impl Writer {
pub fn init(in:FileFormat, out:FileFormat) -> Result<Writer, FormatCompatibilityError> { ... }
}
this can be a good start, but will need a way to pass all of the options, for example the option to count overlaps should be known by the writer. But this can be in an additional struct.
The above seems good enough to proceed to specify input format X to output format Y.
But we also need to handle adding extra columns. For example, if the user asked for counts (-c
in bedtools). In VCF/BCF, this would go into the INFO field with, for example counts=4,5,6 which also requires updating the header of the VCF.
For BAM, it can go to tags and for bed, it is appended as a column.
Who is responsible for keeping that information?
Is it:
let wtr = Writer::init(in_fmt, out_fmt, compression, args.count, args.count_bases, ...)
or
wtr.write(report, count, count_bases)
but then any possible additional count or column must be an argument to these. We can likely special-case those and also accept a function:
impl Writer {
pub fn init(
in_fmt: FileFormat,
out_fmt: Option<FileFormat>,
compression: Compression,
column_fn: fn(Option<&report>) -> String
)
with the convention that when the argument is None
, the function returns the name (for VCF), e.g. count
so that it can be added to the header. But we also need to know the type for the VCF header...
So then we can instead accept the trait:
pub trait ColumnReporter {
/// report the name, e.g. `count` for the INFO field of the VCF
fn name(&self) -> String
/// report the type, for the INFO field of the VCF
fn type(&self) -> Type // Type is some enum from noodles or here that limits to relevant types
fn description(&self) -> String
fn number(&self) -> Number // Number is some enum ...
fn value(&report) -> Value // Value probably something from noodles that encapsulates Float/Int/Vec<Float>/String/...
}
This makes it more work to implement, but I think it's worth it and actually not too onerous.
With this, Writer.write
can be:
impl Writer {
...
pub fn write(r: &Report, Vec<dyn ColumnReporter>) {...}
}
This is started in 9e6f440
It's necessary to have a way to link each ColumnReporter
which is responsible for the type and name metadata with the files in the Report
. Report
is Vec<ReportFragment>
and that is:
pub struct ReportFragment {
pub a: Option<Position>,
pub b: Vec<Position>,
pub id: usize,
}
so we might want
pub fn write(&self, report: &Report, crs: Vec<Vec<Box<dyn ColumnReporter>>>) {
where the outer vec is has length of the number of b
files and each of those has a length equal to the number of columns from that file.
This can get verbose, when we want a simple thing like the 3 bed columns, then we need:
struct BedReporter {
name_: String,
idx: usize,
type: Type,
desc: String,
}
impl ColumnReporter for BedReporter {
fn name(&self) { self.name }
fn ftype(&self) { self.type }
...
fn value(&self, r: &ReportFragment) -> Result<Value, ColumnError> {
r.b.map(|p| {
match p {
Position::Bed(br) => match self.idx {
0 => p.chrom(),
1 => p.start(),
2 => p.stop()
},
_ => Error
}
})
}
}
This is very flexible, but would get quite complicated just to extract a single column
I am exploring using lua snippets again as this seems flexible and user-friendly enough. The cost is performance, but I think that's not the main concern.
We could use
std::fmt::Display
but what if we have VCF input and want to write, e.g. BCF output.Currently, we're using standard outputting, but should likely switch to writers. This will require tracking that, for example, we can have a BCF writer from a VCF query, or a compressed (BGZF) writer from a VCF, but not a BAM writer from a VCF query.
The user should be able to specify, e.g.
--output-format BAM | CRAM | SAM
or--output-format BCF | VCF
along with a--compression-level = 0..9
For BED, perhaps it's
--output-format BED12 | BED3 | ???
where perhaps ??? is a template that can be filled.I need to think about how how to implement this and to look at changes to noodles in 0.61.0 that unify BAM/CRAM/SAM reading.