hbctraining / scRNA-seq_online

https://hbctraining.github.io/scRNA-seq_online/.
493 stars 175 forks source link

many to many warning when assigning annotations to markers in scRNAseq #112

Open hwick opened 9 months ago

hwick commented 9 months ago

In https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html adding annotations in the loop using the following method results in a many-to-many warning (when using methods to download annotations using annotationHub). This is because unique(annotations[, c("gene_name", "description")]) only finds unique gene_name/description combinations, but there are often gene repeats where genes with the same gene_name have slightly different descriptions due to being from different seq_names (such as being from a patch). For me, the many-to-many warning does not pop up when in the loop below but pops up if assigning to a single object, like the results of running FindAllMarkers. Despite the warning not popping up in the loop, I found instances of repeat gene names in my FindConservedMarkers output, with the same p values, fold changes, etc (this is what tipped me off to investigate this problem).

I downloaded annotations from annotationHub using methods outlined in another lesson section, rather than the annotations.csv example file. I'm not sure if this issue exists in the annotations.csv file, but regardless it creates some issue applying the lesson to real world analysis.

get_conserved <- function(cluster){
  FindConservedMarkers(seurat_integrated,
                       ident.1 = cluster,
                       grouping.var = "sample",
                       only.pos = TRUE) %>%
    rownames_to_column(var = "gene") %>%
    left_join(y = unique(annotations[, c("gene_name", "description")]),
               by = c("gene" = "gene_name")) %>%
    cbind(cluster_id = cluster, .)
  }

> dim(annotations) #full size
[1] 71440     5
> dim( unique(annotations[, c("gene_name", "description")])) #just unique gene_name/description combos, as used in the above loops
[1] 47618     2
> length(unique(annotations$gene_name)) #just unique gene_names
[1] 40902