rnabioco / clustifyr

Infer cell types in scRNA-seq data using bulk RNA-seq or gene sets
https://rnabioco.github.io/clustifyr/
MIT License
103 stars 14 forks source link

Seurat 4 compatibility? #374

Closed Dazcam closed 3 years ago

Dazcam commented 3 years ago

Hi there,

I would like to ask if clustifyr is compatible with data generated using Seurat 4?

The reason I ask it that I have ran clustifyr with one of the suggested reference datasets that you provide (ref_cortex_dev), and when I try to visualise my Seurat 4 generated clusters by grouping on res$type, only 5 or 7 of the 47 cell types identified in the reference dataset are mapped onto my cells and they don't really make any sense (depending on whether I set query_genes to NULL or VariableFeatures(seurat.batch1)).

Whilst I understand that the degree to which cell ID correlates between datasets depends on the similarity of their cell types, tissue of origin, data quality etc., I think it's unlikely that only 5-7 of the 47 cell cell types would map over.

There have been some minor, but perhaps important, changes made to the underlying code of some of the key Seurat 4 functions. In particular, the FindMarkers function now reports log fold differential expression values to base 2 instead of to the natural log. So it would be useful to clarify whether the changes made to Seurat directly impact the clustifyr output.

Any advice you could offer on this matter would be greatly appreciated.

Many Thanks,

Darren


UPDATE: When I ran the same analysis as above using Seurat 3 parameters 14 cell-types mapped over in a manner that makes more sense biologically suggesting that clustifyr is not compatible with Seurat 4 output.

raysinensis commented 3 years ago

Thanks for making us aware of the issue, investigating...

raysinensis commented 3 years ago

Hi @Dazcam, Looking at the changes in Seurat, the two things I can think of is 1) we find clustifyr performs worse with sctransform than other normalization methods. 2) by default Seurat 4 might be returning 3000 variable genes instead of the previous 2000, and 2000 was probably already too large a number (and setting to NULL would use all genes, which would produce even worse results). Can you try setting query_genes to something like query_genes = VariableFeatures(seurat.batch1)[1:1000]?

If none of the above explains the errors, any chance you can share the 2 different versions of objects with us?

Many thanks, Rui

Dazcam commented 3 years ago

Hi Rui,

Many thanks for looking into this. Off the top of my head I have a feeling there were 3000 genes when running with Seurat 4 but I will need to check this. It’s interesting that including more genes hampers clustifyr’s performance.

Unfortunately I’m off on holiday until Jan 4th so won’t be able to test your suggestions (or send test data) until then.

I will look into this as soon as return and report back.

Thanks again,

Darren

Dazcam commented 3 years ago

Hello again Rui,

I have managed to have a look at this again and the issue was caused by the number of variable genes fed into clustfyr. When I lowered the variable genes down to 1000, the cluster assignments made much more sense.

I see you have updated the README and mention that scTransform may not be ideal with clustfyr, I will keep this in mind. We haven't settled on the final normalisation method that we're going to use for our analysis yet. I ran this most recent analysis using the Seurat default normalisation method.

Thanks again for help with this.

Best,

Darren