theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
191 stars 23 forks source link

Question regarding log2fc returned by differential expression Wald testing #198

Open YousufYassouf opened 3 years ago

YousufYassouf commented 3 years ago

Hi, I was trying to find DEGs and their fold change between two conditions (Conditions =['1Con', '2Con']) and to regress out two covariates (SampleType =['Frozen', 'Fresh'], n_genes).

I have two questions:

  1. How to interpret the resulting log2fc, is it 1Con/2Con or the 2Con/1Con?
  2. 'caught 141 linalg singular matrix errors' appeared in the output, but I still got the summary table, is it correct or should I do something about it?

Any help would be appreciated!

I used this script to perform the analysis:

test = de.test.wald(data=fadata,
formula_loc="~1+Conditions+SampleType+SampleType:Conditions+n_genes+n_genes:Conditions",
                    factor_loc_totest="Conditions", as_numeric=['n_genes'],
                    sample_description=fadata.obs, noise_model="nb", training_strategy="DEFAULT", dtype="float64")
test.summary().to_csv(sc.settings.figdir /'de.test.wald.summary.tsv', sep="\t", index=False)

I got this output:

Diffxpy de.test.wald testing
training location model: True
training scale model: True
iter   0: ll=8489257375.931549
iter   1: ll=8486685806.088095, converged: 0.00% (loc: 4.67%, scale update: False), in 26.07sec
caught 141 linalg singular matrix errors
iter   2: ll=8486679830.490733, converged: 0.00% (loc: 98.58%, scale update: False), in 24.84sec
iter   3: ll=8486678948.803598, converged: 0.00% (loc: 98.58%, scale update: False), in 4.52sec
iter   4: ll=8486678815.294401, converged: 0.00% (loc: 98.65%, scale update: False), in 4.52sec
iter   5: ll=8486678785.821270, converged: 0.00% (loc: 98.88%, scale update: False), in 4.45sec

Fitting 24506 dispersion models: (progress not available with multiprocessing)
iter   6: ll=70569325.724172, converged: 3.62% (loc: 3.62%, scale update: True), in 1788.10sec
caught 123 linalg singular matrix errors
iter   7: ll=70296497.765427, converged: 3.62% (loc: 35.44%, scale update: False), in 23.67sec
iter   8: ll=70260749.227150, converged: 3.62% (loc: 40.07%, scale update: False), in 18.26sec
iter   9: ll=70251928.577203, converged: 3.62% (loc: 58.57%, scale update: False), in 17.01sec
iter  10: ll=70249711.995841, converged: 3.62% (loc: 81.31%, scale update: False), in 12.83sec
iter  11: ll=70249379.829412, converged: 3.62% (loc: 89.92%, scale update: False), in 7.47sec

Fitting 23619 dispersion models: (progress not available with multiprocessing)
iter  12: ll=54446548.515894, converged: 20.45% (loc: 20.45%, scale update: True), in 1559.48sec
caught 46 linalg singular matrix errors
iter  13: ll=54435626.591855, converged: 20.45% (loc: 87.07%, scale update: False), in 20.72sec
iter  14: ll=54434658.754988, converged: 20.45% (loc: 94.59%, scale update: False), in 6.43sec
iter  15: ll=54434444.213969, converged: 20.45% (loc: 96.54%, scale update: False), in 5.81sec
iter  16: ll=54434378.910931, converged: 20.45% (loc: 97.34%, scale update: False), in 5.26sec
iter  17: ll=54434362.876335, converged: 20.45% (loc: 97.69%, scale update: False), in 5.24sec

Fitting 19494 dispersion models: (progress not available with multiprocessing)
iter  18: ll=48554309.314322, converged: 67.49% (loc: 67.49%, scale update: True), in 1260.92sec
caught 12 linalg singular matrix errors
iter  19: ll=48554130.906475, converged: 67.49% (loc: 97.40%, scale update: False), in 11.38sec
iter  20: ll=48554108.553771, converged: 67.49% (loc: 98.90%, scale update: False), in 5.42sec
caught 11 linalg singular matrix errors
iter  21: ll=48554100.295546, converged: 67.49% (loc: 99.10%, scale update: False), in 3.43sec
caught 44 linalg singular matrix errors
iter  22: ll=48554096.867598, converged: 67.49% (loc: 99.37%, scale update: False), in 3.20sec
caught 27 linalg singular matrix errors
iter  23: ll=48554095.283522, converged: 67.49% (loc: 99.52%, scale update: False), in 2.71sec

Fitting 7967 dispersion models: (progress not available with multiprocessing)
iter  24: ll=46569302.387164, converged: 90.46% (loc: 90.46%, scale update: True), in 527.66sec
caught 116 linalg singular matrix errors
iter  25: ll=46569299.973292, converged: 90.46% (loc: 99.49%, scale update: False), in 6.18sec
caught 11 linalg singular matrix errors
iter  26: ll=46569299.754370, converged: 90.46% (loc: 99.87%, scale update: False), in 2.71sec
caught 6 linalg singular matrix errors
iter  27: ll=46569299.752718, converged: 90.46% (loc: 99.92%, scale update: False), in 1.41sec
iter  28: ll=46569299.752676, converged: 90.46% (loc: 99.93%, scale update: False), in 1.50sec
caught 3 linalg singular matrix errors
iter  29: ll=46569299.752662, converged: 90.46% (loc: 99.96%, scale update: False), in 1.19sec

Fitting 2337 dispersion models: (progress not available with multiprocessing)
iter  30: ll=46113421.509132, converged: 97.82% (loc: 97.82%, scale update: True), in 175.80sec
caught 30 linalg singular matrix errors
iter  31: ll=46113419.735521, converged: 97.82% (loc: 99.93%, scale update: False), in 5.41sec
caught 6 linalg singular matrix errors
iter  32: ll=46113419.712236, converged: 97.82% (loc: 100.00%, scale update: False), in 1.23sec
iter  33: ll=46113419.710071, converged: 97.82% (loc: 100.00%, scale update: False), in 1.25sec
iter  34: ll=46113419.709985, converged: 97.82% (loc: 100.00%, scale update: False), in 1.01sec
iter  35: ll=46113419.709984, converged: 97.82% (loc: 100.00%, scale update: False), in 1.02sec

Fitting 534 dispersion models: (progress not available with multiprocessing)
iter  36: ll=45997409.837720, converged: 99.64% (loc: 99.64%, scale update: True), in 57.52sec
caught 12 linalg singular matrix errors
iter  37: ll=45997409.834046, converged: 99.64% (loc: 99.98%, scale update: False), in 2.16sec
iter  38: ll=45997409.834046, converged: 99.64% (loc: 100.00%, scale update: False), in 1.53sec

Fitting 89 dispersion models: (progress not available with multiprocessing)
iter  39: ll=45989396.952243, converged: 99.92% (loc: 99.92%, scale update: True), in 32.57sec
caught 1 linalg singular matrix errors
iter  40: ll=45989396.952202, converged: 99.92% (loc: 100.00%, scale update: False), in 1.58sec
iter  41: ll=45989396.952202, converged: 99.92% (loc: 100.00%, scale update: False), in 1.13sec

Fitting 20 dispersion models: (progress not available with multiprocessing)
iter  42: ll=45950000.848035, converged: 99.97% (loc: 99.97%, scale update: True), in 28.84sec
iter  43: ll=45950000.848035, converged: 99.97% (loc: 100.00%, scale update: False), in 1.55sec

Fitting 7 dispersion models: (progress not available with multiprocessing)
iter  44: ll=45949070.365447, converged: 100.00% (loc: 100.00%, scale update: True), in 24.72sec
iter  45: ll=45949070.365447, converged: 100.00% (loc: 100.00%, scale update: False), in 1.35sec

Fitting dispersion models: 0.00% in 0.01sec
iter  46: ll=45949070.365447, converged: 100.00% (loc: 100.00%, scale update: True), in 1.25sec

and this is the head of the resulting tsv file

gene    pval    qval    log2fc  mean    zero_mean   grad    coef_mle    coef_sd ll
A1BG    0.312478681 0.787738151 0.161516688 0.092773702 FALSE   4.056426074 0.161516688 0.159912024 -6781.629286
A1BG-AS1    0.48030134  0.929602236 0.331335826 0.009863314 FALSE   0.009596474 0.331335826 0.469435231 -385.7717783
A1CF    0.894546706 1   -0.21362873 0.001293466 FALSE   0.082358417 -0.21362873 1.611644585 -82.14013579
A2M 0.047744881 0.275626869 -0.763245681    0.01917489  FALSE   0.262260202 -0.763245681    0.385549251 -1230.818608
A2M-AS1 0.524215312 0.957188022 -0.469050349    0.005311979 FALSE   0.000410546 -0.469050349    0.736503355 -560.8241553
A2ML1   0.646698854 1   -0.415191149    0.003174966 FALSE   6.79E-06    -0.415191149    0.905832734 -467.2889809
A2ML1-AS2   1   1   0   0   TRUE    5.03E-127   0   3.37E+63    -7.31E-126
A4GALT  0.21076328  0.658880588 -1.394940411    0.001117579 FALSE   1.47615582  -1.394940411    1.114642066 -27.84762266
AAAS    0.901200251 1   0.024925929 0.046980111 FALSE   11.23234409 0.024925929 0.200780325 -1885.289258
YousufYassouf commented 3 years ago

Edit: I have used the following script

de.utils.preview_coef_names(sample_description=fadata.obs,
                                  formula="~1+Conditions+SampleType+SampleType:Conditions+n_genes+n_genes:Conditions",
                                  as_numeric='n_genes')

and got this output:

['Intercept', 'Conditions[T.2Con]', 'SampleType[T.frozen PBMC]', 'SampleType[T.frozen PBMC]:Conditions[T.2Con]', 'n_genes', 'n_genes:Conditions[T.2Con]']

it seems that '1Con' is in the intercept and the fold change is estimated as 2Con/1Con, so is there a simple way to choose which coefficient from the 'Conditions' will be in the intercept (maybe like selecting reference) to interpret log2c more easily?