theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
191 stars 23 forks source link

Unusual Log2fc Values from Tests #167

Closed mattmajo closed 4 years ago

mattmajo commented 4 years ago

Hi there, I have been trying to run diffxpy on my raw dataset:

> AnnData object with n_obs × n_vars = 76994 × 33694 
     obs: 'n_counts', 'n_genes', 'mito', 'doublet_scores', 'predicted_doublets', 'name', 'Sample', 'donor', 'organ', 'sort', 'method', 'file',      'is_TRA_p', 'is_TRB_p', 'is_TRA_np', 'is_TRB_np', 'Age', 'Source', 'cell types', 'source', 'birth', 'batch', 'bbk', 'donor_method'
     var: 'GeneName', 'GeneID

When I run any differential expression tests between a condition with 2 factors I end up getting a strange result.

> import diffxpy.api as de
> test = de.test.t_test(adata, grouping='cell types')
> print(test.summary().iloc[:,:])

               gene |      pval      |    qval    |   log2fc   |   mean
--------|---------- | -------------- | ---------- | ---------- | -----------
381             SIM1   3.173105e-01   4.275500e-01 -1061.023794   0.000107   
1704         POU5F1B   3.173105e-01   4.275500e-01 -1061.023794   0.000107   
869             DDX4   3.173105e-01   4.275500e-01 -1061.023794   0.000107   
471            ESRRB   3.173105e-01   4.275500e-01 -1061.023794   0.000107   
359             MYF5   3.173105e-01   4.275500e-01 -1061.023794   0.000107   
955            FGF19   2.403784e-01   3.633407e-01     4.998926   0.000644   
1478           LUZP2   3.397947e-19   7.599045e-18    -4.806819   0.019317   
977              FEV   1.636333e-03   5.858140e-03     4.764461   0.002254   
878            FEZF2   3.369594e-01   4.494532e-01     4.676998   0.000537   
695             PAX3   1.166546e-02   3.124094e-02     4.484353   0.000966   
1013           SOX17   1.262338e-01   2.227199e-01     4.221319   0.005044   
351             DBX1   1.949431e-01   3.079743e-01     3.676998   0.000322   

I get absurdly large log2fc for some genes and then it drops off such that I am getting more reasonable number. I was wondering if it looks like there is an issue in the way things are running because all of the log2fc around |1060| doesn't seem quite right.

Thanks!

davidsebfischer commented 4 years ago

Hi @mattmajo! These -1061.023794 lfc genes have mean 0.000107, could it be that they are entirely zero in one condition and have 1 or two non zeros in the other? if they are entirely zero in one condition, the LFC is technically infinity and LFC as such is not a very helpful metric anymore, this was the reasoning for including the gene average in this plot so this can be used in context.

mattmajo commented 4 years ago

Thank you @davidsebfischer - is the mean in this case the mean over the entire data set and not dependent on conditions?

davidsebfischer commented 4 years ago

Thank you @davidsebfischer - is the mean in this case the mean over the entire data set and not dependent on conditions?

Exactly, the plain average of the input data!