Open grst opened 11 months ago
Hi @grst @const-ae @emdann, is there a consensus regarding what would be most convenient? I'm assuming we want to use formulaic?
I won't have the bandwidth to implement this feature on my own in the next few weeks, but if anyone wants to give it a try, I'm happing to help them.
I don't even think you'd need to deal with formulaic/patsy in PyDESeq2, at least initially. Either tool generates a design matrix (which advanced users could also create manually) which should be the input for PyDESeq2.
I agree with Gregor that the easiest change might be to simply allow some way to provide a design matrix and then just skip the step build_design_matrix
at https://github.com/owkin/PyDESeq2/blob/main/pydeseq2/dds.py#L249. Of course, longer term I think it would be great to save the user from converting data + formula to a design matrix and do it internally, but in the end it's just syntactic sugar :)
The PR #181 is implementing the ability to give a design matrix directly however for now it needs to follow pydeseq2 naming conventions for further preprocessing namely the _vs_
syntax.
Don't hesitate to play with the branch and give feedbacks on limitations.
for now it needs to follow pydeseq2 naming conventions for further preprocessing namely the vs syntax
does that mean if it doesn't follow the naming conventions it doesn't work at all, or would I just have to specify contrasts manually?
for now it needs to follow pydeseq2 naming conventions for further preprocessing namely the vs syntax
does that mean if it doesn't follow the naming conventions it doesn't work at all, or would I just have to specify contrasts manually?
The way that it is done in this PR is that to extract the design_factors
from a user-given design_matrix
it assumes interactions are given following pydeseq2 naming conventions. This processing is fairly straightforward and can be inspected in pydeseq2/utils.py
at the end of the process_design_factors
function (if design_matrix
is not None).
We realize that the current situation is not optimal and are trying actively to find the best trade-off between the coverage of deseq2/formulaic functionalities we support and merging this PR "quickly" (sorry that it already took so many times) given the very limited bandwidth we currently have.
Is your feature request related to a problem? Please describe. Most linear models support passing designs as design matrices and contrasts as contrast vectors. This is the "smallest common denominator" for specifying designs and it's useful
[column, baseline, treatment]
tripletDescribe the solution you'd like
DeSeqDataset
should take a design matrixDeseqStats
should take a contrast vector with one value per fitted coefficient, such as[0, -1, 1]
.Additional context discussed on the scverse hackathon in Cambridge
CC @const-ae @emdann