neurorestore / Augur

Cell type prioritization in single-cell data
MIT License
100 stars 11 forks source link

Effect of cell type abundance on AUC values #19

Open hayfre opened 2 years ago

hayfre commented 2 years ago

Hi! I have been testing out Augur on my single cell dataset that contains 25 different cell types that range in abundance from 50 cells to over 4000 cells. I’ve noticed that Augur seems to produce higher AUC values for some of the least abundant cell populations that I do not expect to be majorly changed between the conditions. I wonder if this is due to the subsample size – does randomly drawing 20/100 cells repeatedly train the classifier to cover more variation in the cell population than drawing 20/2000 cells repeatedly? If this is the case, do you have any recommendations for how to address this potential bias/which arguments to adjust in the calculate_auc function? Thanks!

skinnider commented 2 years ago

Hi @hayfre - without knowing more about your particular dataset I can only speak in generalities, but my intuition would be that if you can rule out a biological effect, there may be a significant technical effect affecting this population (for instance - cells of this type from one of your libraries are stressed/dying). If there are only a few cells of this type, these cells would be present in every subsample and would make the two conditions easier for the RF to separate.

This is just one potential explanation, but you could experiment with changing the subsample size and see if your results are stable - we found they generally were (Supp. Figs. 6 and 10 in the Augur paper) but it may be the case that the AUC for your small cell population is more sensitive. Only thing I would suggest is if you are going to lower the subsample size you may want to increase the number of subsamples to give the prioritization a better chance to 'converge'.