I would like to propose support for sample-/gene-dependent normalization factors to enable the use length offset matrices from pytximport, a full-featured Python implementation of tximport, similar to how DESeq2 supports length matrices from tximport. With this issue, I would like to see if there is interest in this addition and discuss how best to implement it given the current design of PyDESeq2.
Context
DESeq2 implements the DESeqDataSetFromTximport function, that takes a tximport object and creates a DESeqDataSet from it that utilises the abundance-weighted average gene length across isoforms for each sample to calculate normalisation factors. The rationale is that differential isoform usage may lead to different average gene lengths between samples and introduce a bias that needs to be corrected for. See also: https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/vignettes/DESeq2.Rmd#L2181
DESeq2 uses this sample-specific average gene length when DESeqDataSet.estimateSizeFactors is called. In short, the sample-specific average gene length from tximport is divided by the gene-wise geometric mean to create a normalization matrix which is then passed together with the counts to estimateNormFactors, where the counts divided by the normalization matrix are passed to estimateSizeFactorsForMatrix to calculate size factors. Then the normalization factors are calculated: norm_factors = (normMatrix.T*size_factors).T / geometric_mean((normMatrix.T*size_factors).T). These normalization factors are then used instead of the size factors for various downstream tasks.
With the current design of PyDESeq2, it might not be necessary to implement a standalone function like DESeqDataSetFromTximport, since that function only adds a key to the DDS. When using the PyDESeq2 DDS with an AnnData object created by pytximport, the "length" observation matrix and the unstructured key “counts_from_abundance” will already be present, eliminating the need for a DESeqDataSetFromTximport function. A check whether a length offset matrix can be used is as simple as:
if "length" in self.obsm_keys() and "counts_from_abundance" in self.uns_keys() and self.uns["counts_from_abundance"] is None:
# do something
However, the implementation of the size factor fitting and count normalization currently differs significantly between the R and Python version of DESeq2, making it non-trivial to implement support for normalization factors. As far as I understand the DESeq2 and PyDESeq2 code, the following steps would be necessary to add support for normalization factors based on length offsets from pytximport.
When fit_typeis ratio, DeSeqDataset.fit_size_factors would have to detect the presence of a precomputed normalization matrix or of a length matrix that could be transformed into a normalization matrix.
Then, like with DESeq2, there would be two options:
No normalization/length matrix => call the equivalent to estimateSizeFactorsForMatrix (deseq2_norm_fit & deseq2_norm_transform in the case of PyDESeq2, current implementation)
Normalization matrix present => implement an equivalent of estimateNormFactors, for example as part of DeSeqDataset.fit_size_factors
Pseudocode implementation:
def fit_size_factors(..., norm_matrix=None,) -> None:
# ...
if "length" in self.obsm_keys() and "counts_from_abundance" in self.uns_keys() and self.uns["counts_from_abundance"] is None:
norm_matrix = self.obsm["length"] / row_wise_geometric_mean(self.obsm["length"])
if norm_matrix:
counts = self.X / norm_matrix
log_means, filtered_genes = deseq2_norm_fit(counts)
_, size_factors = deseq2_norm_transform(counts, self.logmeans)
normalization_factors = (norm_matrix.T*size_factors).T
normalization_factors = normalization_factors / row_wise_geometric_mean(normalization_factors)
else:
# no change
Then, support of the normalization factors in the downstream methods would have to be implemented. DESeq2 uses them as part of the negative binominal GLM fitting used with the Wald test, likelihood ratio test and LFC shrinking. See:
Implementing this would therefore affect several parts of the PyDESeq2 codebase, so I am opening this issue to discuss the need and best way to implement these changes first. In my opinion, support for normalization factors would be an important addition to PyDESeq2, not only to support correction for differential isoform usage based on the output of pytximport, but also to allow corrections for differences in gene length for multi-laboratory or multi-timepoint datasets. I am happy to contribute code to implement this feature and look forward to your feedback.
As the maintainer of pytximport, I can also change naming conventions or add further metadata to the output of pytximport AnnData objects to facilitate integration with PyDESeq2, if you think that relying on the presence of the counts_from_abundance key in the unstructured part of the AnnData object is not sufficient.
Request
I would like to propose support for sample-/gene-dependent normalization factors to enable the use length offset matrices from pytximport, a full-featured Python implementation of tximport, similar to how DESeq2 supports length matrices from tximport. With this issue, I would like to see if there is interest in this addition and discuss how best to implement it given the current design of PyDESeq2.
Context
DESeq2 implements the
DESeqDataSetFromTximport
function, that takes a tximport object and creates a DESeqDataSet from it that utilises the abundance-weighted average gene length across isoforms for each sample to calculate normalisation factors. The rationale is that differential isoform usage may lead to different average gene lengths between samples and introduce a bias that needs to be corrected for. See also: https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/vignettes/DESeq2.Rmd#L2181DESeq2 uses this sample-specific average gene length when
DESeqDataSet.estimateSizeFactors
is called. In short, the sample-specific average gene length from tximport is divided by the gene-wise geometric mean to create a normalization matrix which is then passed together with the counts toestimateNormFactors
, where the counts divided by the normalization matrix are passed toestimateSizeFactorsForMatrix
to calculate size factors. Then the normalization factors are calculated:norm_factors = (normMatrix.T*size_factors).T / geometric_mean((normMatrix.T*size_factors).T)
. These normalization factors are then used instead of the size factors for various downstream tasks.Relevant parts of the DESeq2 code: https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/R/AllClasses.R#L408 https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/R/methods.R#L383 https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/R/core.R#L2185 https://github.com/thelovelab/DESeq2/blob/6e1a3a56f7f6315e4a5ff352a203af26e778f73a/R/core.R#L542
Challenges for implementation
With the current design of PyDESeq2, it might not be necessary to implement a standalone function like
DESeqDataSetFromTximport
, since that function only adds a key to the DDS. When using the PyDESeq2 DDS with an AnnData object created by pytximport, the "length" observation matrix and the unstructured key “counts_from_abundance” will already be present, eliminating the need for a DESeqDataSetFromTximport function. A check whether a length offset matrix can be used is as simple as:However, the implementation of the size factor fitting and count normalization currently differs significantly between the R and Python version of DESeq2, making it non-trivial to implement support for normalization factors. As far as I understand the DESeq2 and PyDESeq2 code, the following steps would be necessary to add support for normalization factors based on length offsets from pytximport.
When
fit_type
isratio
,DeSeqDataset.fit_size_factors
would have to detect the presence of a precomputed normalization matrix or of a length matrix that could be transformed into a normalization matrix. Then, like with DESeq2, there would be two options:estimateSizeFactorsForMatrix
(deseq2_norm_fit
&deseq2_norm_transform
in the case of PyDESeq2, current implementation)estimateNormFactors
, for example as part ofDeSeqDataset.fit_size_factors
Pseudocode implementation:
Then, support of the normalization factors in the downstream methods would have to be implemented.
DESeq2
uses them as part of the negative binominal GLM fitting used with the Wald test, likelihood ratio test and LFC shrinking. See:Implementing this would therefore affect several parts of the PyDESeq2 codebase, so I am opening this issue to discuss the need and best way to implement these changes first. In my opinion, support for normalization factors would be an important addition to PyDESeq2, not only to support correction for differential isoform usage based on the output of pytximport, but also to allow corrections for differences in gene length for multi-laboratory or multi-timepoint datasets. I am happy to contribute code to implement this feature and look forward to your feedback.
As the maintainer of pytximport, I can also change naming conventions or add further metadata to the output of pytximport AnnData objects to facilitate integration with PyDESeq2, if you think that relying on the presence of the
counts_from_abundance
key in the unstructured part of the AnnData object is not sufficient.Best, Malte