neurorestore / Augur

Cell type prioritization in single-cell data
MIT License
100 stars 11 forks source link

Very small number of trees affect on AUC and comparing AUC across experiments #14

Open vincentrose88 opened 2 years ago

vincentrose88 commented 2 years ago

Hi

Great work on the R-package Augur!

I'm using it to prioritise cell types response on treatment on a disease in two setups: 1x treatment and 5x treatment, and I have a couple of question on how to interpret and use the AUC results:

AUC comparison across experiments?

My question is: Can I compare the AUC across these experiments directly, or can I only use the rank?

For example: Does the Cell-type_A in G2 have a comparable response to Cell-type_A in G1, while Cell-type_I have a significantly bigger response in G2 than G1 in below results?

Results

The experimental groups and results are (anonymised due this being a clients data):

G1: 1x treatment + disease (case) VS 1x placebo + disease (control)

  cell_type     auc
  <chr>         <dbl>
1 Cell-type_B   0.952
2 Cell-type_A   0.944
3 Cell-type_C   0.838
4 Cell-type_E   0.719
5 Cell-type_D   0.707
6 Cell-type_F   0.668
7 Cell-type_H   0.666
8 Cell-type_G   0.666
9 Cell-type_I   0.640

G2: 5x treatment + disease (case) VS 5x placebo + disease (control)

  cell_type    auc
  <chr>        <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Number of trees affect on AUC

For the experiment group G2 (5x treatment vs 5x placebo), I only get useful results if I use a very low number of trees, as you suggest in your paper (Methods: Hyperparameter analysis)

[…] Empirically, we suggest decreasing the number of trees in the random forest classifier in scenarios where perfect classification can be achieved for many cell types (Supplementary Fig. 10g).

My question is simply: Does it makes sense to have so few trees?

Results

(Only number of trees changes, all other options are default)

Num_tree = 50

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_E  1   
2 Cell-type_A  1   
3 Cell-type_I  1
4 Cell-type_D  1
5 Cell-type_H  1
6 Cell-type_F  1
7 Cell-type_C  1
8 Cell-type_B  1
9 Cell-type_G  1

Num_tree = 10

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_E  1   
2 Cell-type_A  1   
3 Cell-type_I  1.00
4 Cell-type_D  1.00
5 Cell-type_H  1.00
6 Cell-type_F  1.00
7 Cell-type_C  1.00
8 Cell-type_B  1.00
9 Cell-type_G  1.00

Num_tree = 5

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  1.00 
2 Cell-type_B  0.999
3 Cell-type_E  0.998
4 Cell-type_C  0.996
5 Cell-type_D  0.996
6 Cell-type_F  0.995
7 Cell-type_H  0.993
8 Cell-type_G  0.993
9 Cell-type_I  0.990

Num_tree = 3

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.996
2 Cell-type_B  0.995
3 Cell-type_C  0.989
4 Cell-type_F  0.984
5 Cell-type_D  0.983
6 Cell-type_E  0.982
7 Cell-type_G  0.979
8 Cell-type_H  0.975
9 Cell-type_I  0.965

Num_tree = 2

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Num_tree = 1

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.942
2 Cell-type_B  0.933
3 Cell-type_C  0.893
4 Cell-type_G  0.873
5 Cell-type_E  0.870
6 Cell-type_D  0.870
7 Cell-type_F  0.857
8 Cell-type_H  0.839
9 Cell-type_I  0.812

Looking forward to your feedback and thanks in advance!

Kind regard

jordansquair commented 2 years ago

Are you using a seurat object as input or directly a count/normalized matrix? If a Seurat object, can you check the default assay?

vincentrose88 commented 2 years ago

Are you using a seurat object as input or directly a count/normalized matrix? If a Seurat object, can you check the default assay?

Yes I'm using a Seurat object and the default assay is "integrated"

> DefaultAssay(seurat_obj)
[1] "integrated"
jordansquair commented 2 years ago

You will want to switch that back to "RNA" or directly input the count matrix.

DefaultAssay(obj) = "RNA"

Then run Augur.

To answer your question about the experimental design. Yes, you can compare the AUCs themselves.

You may want to consider using differential prioritization for this case also. You can see our protocol: https://www.nature.com/articles/s41596-021-00561-x for more details (specifically Case Study #4).

vincentrose88 commented 2 years ago

You will want to switch that back to "RNA" or directly input the count matrix.

DefaultAssay(obj) = "RNA"

Then run Augur.

To answer your question about the experimental design. Yes, you can compare the AUCs themselves.

You may want to consider using differential prioritization for this case also. You can see our protocol: https://www.nature.com/articles/s41596-021-00561-x for more details (specifically Case Study #4).

Thanks!

I'll give that a try!

vincentrose88 commented 2 years ago

Using RNA as the default assay I get more sensible results (with num tree = 50):

  annotation  auc
1 Cell-type_A 0.6052060
2 Cell-type_B 0.5276417
3 Cell-type_C 0.5242139
4 Cell-type_D 0.5189135
5 Cell-type_E 0.5170862
6 Cell-type_F 0.5112566
7 Cell-type_G 0.5066270
8 Cell-type_H 0.4989002

Thanks for the help! You can consider this issue closed 👍

vincentrose88 commented 2 years ago

Thinking more about these results, I'm surprised that the AUC is so much higher when running on a Seurat integrated space than on RNA:

RNA (num_tree = 50)

  annotation  auc
1 Cell-type_A 0.6052060
2 Cell-type_B 0.5276417
3 Cell-type_C 0.5242139
4 Cell-type_D 0.5189135
5 Cell-type_E 0.5170862
6 Cell-type_F 0.5112566
7 Cell-type_G 0.5066270
8 Cell-type_H 0.4989002

Integrated (num_tree = 2)

  cell_type       auc
  <chr>         <dbl>
1 Cell-type_A  0.991
2 Cell-type_B  0.976
3 Cell-type_C  0.974
4 Cell-type_D  0.957
5 Cell-type_E  0.957
6 Cell-type_F  0.953
7 Cell-type_G  0.946
8 Cell-type_H  0.935
9 Cell-type_I  0.931

Do you have any explanation for this?