satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 906 forks source link

FindAllMarkers between three identities #497

Closed mvalenzuelav closed 6 years ago

mvalenzuelav commented 6 years ago

Hi all,

I have three genotypes to compare cells between, so I am using always FindAllMarkers instead of FindMarkers, because I guess the last one is only to compare two identities and I have three. So, I would like to ask: what is the meaning of pct.1 and pct.2 in this case? I think pct.1 corresponds to cells in column "cluster" in the result file, and pct.2 to the rest of cells in the other two clusters. Am I correct? Apart from that, I would like to know which is the meaning and usage of the other columns in results file from FindAllMarkers: p_val, avg_logFC and p_val_adj.

  1. I understand that avg_logFC column is giving the difference found in that gene in that cluster/identity compared to the other two, so if the number is positive, it means that is upregulated, and if it is negative, it is downregulated, right? I am obtaining in this column very small numbers, around zero point something... is this meaning that differencies are not really big, even almost inexistent?

  2. p_val is the p-value found in this difference? So if it is very small, the difference found compared to the other clusters is very significant. Is it correct?

  3. I do not know what p_val_adj means.

By the way, should I compare one to one cluster (although it means to run the function several times) or is it fine the way I am doing it: one to the other two?

Thank you very, very much in advance.

Marina

mukundvarma commented 6 years ago

You can see what the outputs of any function mean by doing

?FunctionName

so,

?FindMarkers in your case

satijalab commented 6 years ago

Yes - your interpretation of pct.1 and pct.2 is correct.

The answers to your other questions are in ?FindMarkers

mvalenzuelav commented 6 years ago

I already have read the help for FindMarkers and FindAllMarkers, but it does not specify parameters of the resulting table. It is only said: "Matrix containing a ranked list of putative markers, and associated statistics (p-values, ROC score, etc.)". I only want to know which parameter should I pay attention to, to know which genes are the most relevant to differentiate between populations. High values in avg_logFC?? Are they ordered in the table? Thanks again.

kathush commented 6 years ago

Hi Marina, This is from Seurat's vignettes. Hope you will find it useful. The values are not ordered by this column, so you should sort the avg_logFC column.

The results data frame has the following columns : p_val : p_val (unadjusted) avg_logFC : log fold-chage of the average expression between the two groups. Positive values indicate that the gene is more highly expressed in the first group. pct.1: The percentage of cells where the gene is detected in the first group pct.2: The percentage of cells where the gene is detected in the second group p_val_adj : Adjusted p-value, based on bonferroni correction using all genes in the dataset.

mvalenzuelav commented 6 years ago

Thank you very much @kathush !! So I need to sort by avg_logFC column :) Do you know if I should compare one to one group/genotype (although it means to run the function several times) or is it fine the way I am doing it: one to the other two genotypes? Because I have three groups, not only two.

leonfodoulian commented 6 years ago

Hi Marina,

Do you know if I should compare one to one group/genotype (although it means to run the function several times) or is it fine the way I am doing it: one to the other two genotypes?

It all depends on your question. If you want to have pairwise comparisons, then you will have to compute differential expression between each pair. If, on the other hand, you want to check which genes characterise each condition with respect to all others, then the way you are doing it is fine.

Best, Leon

mvalenzuelav commented 6 years ago

Hi Leon, Thank you for your answer. I totally understand what you mean. It is just a question of what I want to see. But, for example, in FindConservedMarkers, in my case with 3 groups, I obtained all parameters (p_val, avg_logFC, pct.1, pct.2, p_val_adj) for each group (these 5 columns for each genotype). So, in this case, I do not know how to sort avg_logFC, because each gene has a different value in avg_logFC depending on genotype. So how can I select the most conserved genes identifying each cluster if I have three values for avg_logFC in this case? Sorry, I am a bit lost with so many data and do not know how to continue to identify genes in each cluster and genes with differential expression among each cluster. Because I am doing an integrated analysis. Is there any other way to do it? With separate analysis, I can not compared between genotypes, right? Only within a genotype. Thanks again a lot!

leonfodoulian commented 6 years ago

Hi Marina,

Sorry for my delayed answer. I was quite busy lately.

One way to proceed is to create a new column in your table summarising conserved markers between the genotypes for each cluster by computing mean values of avg_logFC between the genotypes (say mean_avg_logFC). (You can also do the same for min or max values). You can then order the table by decreasing mean_avg_logFC values and select the top 10 or so marker genes.

# Create a simulated grouping variable
object@meta.data$groups <- sample(
  x = c("g1", "g2", "g3"),
  size = length(x = object@cell.names),
  replace = TRUE
)

# Create a simulated identity variable
object@meta.data$ident <- sample(
  x = c(1, 2),
  size = length(x = object@cell.names),
  replace = TRUE
)

# Set cell identities to simulated identities
object <- SetAllIdent(object = object, id = "ident")

# Find markers that are conserved between the groups for cell identity 1 compared to cell identity 2
conserved.markers <- FindConservedMarkers(object = object, ident.1 = 1, ident.2 = 2, grouping.var = "groups")

# Select 'avg_logFC' columns indices
avg_logFC_columns <- grep(pattern = "avg_logFC", x = colnames(x = conserved.markers))

# Compute mean 'avg_logFC'
conserved.markers$mean_avg_logFC <- rowMeans(x = conserved.markers[avg_logFC_columns])

## For upregulated genes
# Order 'conserved.markers' by 'mean_avg_logFC'
conserved.markers <- conserved.markers[order(conserved.markers$mean_avg_logFC, decreasing = TRUE), ]

# Get top 10 genes
rownames(x = conserved.markers[1:10,])

## For downregulated genes
# Order 'conserved.markers' by 'mean_avg_logFC'
conserved.markers <- conserved.markers[order(conserved.markers$mean_avg_logFC, decreasing = FALSE), ]

# Get top 10 genes
rownames(x = conserved.markers[1:10,])

Best, Leon

mvalenzuelav commented 6 years ago

This is a good idea! Thanks very much!