greenelab / miQC

Flexible, probablistic metrics for quality control of scRNA-seq data
BSD 3-Clause "New" or "Revised" License
18 stars 1 forks source link

plotMetrics and mixtureModel functions do not accept different column names for data #12

Open JTumulty opened 2 hours ago

JTumulty commented 2 hours ago

Documentation indicates that plotMetrics() and mixtureModel() functions have the option for the user to specify columns in the object that contain the number of unique genes and the percentage of mitochondrial reads through the keywords detected= and subsets_mito_percent=. Neither function acts as expected when other column names are provided. The plotMetrics() function will only display data from columns with those exact names "detected" and "subset_mito_percent" as long as they are present. If these columns are not present in the object, the plot will display a single point. The mixtureModel() function will similarly only return a model if these columns are present, and the model will only use the data from the columns with the exact names "detected" and "subsets_mito_percent". If these columns are not present in the object, it returns an error.

Both issues appear to be due to the fact that detected and subsets_mito_percent are treated as variables in the plotMetrics and mixtureModel function, but when they are provided to ggplot function and flexmix function respectively, they are treated as exact names of columns. This is because ggplot accepts aesthetics without quotations around the column names as does the functional notation in the flexmix function.

Reproducible example using dataset in vignette

> sce <- ZeiselBrainData()
> mt_genes <- grepl("^mt-",  rownames(sce))
> feature_ctrls <- list(mito = rownames(sce)[mt_genes])
> sce <- addPerCellQC(sce, subsets = feature_ctrls)
> head(colData(sce))
DataFrame with 6 rows and 21 columns
                    tissue   group # total mRNA mol      well       sex       age  diameter
               <character> <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
1772071015_C02    sscortex         1          21580        11         1        21      0.00
1772071017_G12    sscortex         1          21748        95        -1        20      9.56
1772071017_A05    sscortex         1          31642        33        -1        20     11.10
1772071014_B06    sscortex         1          32916        42         1        21     11.70
1772067065_H06    sscortex         1          21531        48         1        25     11.00
1772071017_E02    sscortex         1          24799        13        -1        20     11.90
                level1class level2class       sum  detected subsets_mito_sum subsets_mito_detected
                <character> <character> <numeric> <integer>        <numeric>             <integer>
1772071015_C02 interneurons       Int10     22354      4871              774                    23
1772071017_G12 interneurons       Int10     22869      4712             1121                    27
1772071017_A05 interneurons        Int6     32594      6055              952                    27
1772071014_B06 interneurons       Int10     33525      5852              611                    28
1772067065_H06 interneurons        Int9     21694      4724              164                    23
1772071017_E02 interneurons        Int9     25919      5427             1122                    19
               subsets_mito_percent altexps_repeat_sum altexps_repeat_detected altexps_repeat_percent
                          <numeric>          <numeric>               <numeric>              <numeric>
1772071015_C02             3.462468               8181                     419                21.9677
1772071017_G12             4.901832              11854                     480                28.8012
1772071017_A05             2.920783              18021                     582                31.6435
1772071014_B06             1.822521              13955                     512                25.5999
1772067065_H06             0.755969               6876                     363                19.9299
1772071017_E02             4.328871              17364                     618                34.7600
               altexps_ERCC_sum altexps_ERCC_detected altexps_ERCC_percent     total
                      <numeric>             <numeric>            <numeric> <numeric>
1772071015_C02             6706                    43              18.0070     37241
1772071017_G12             6435                    46              15.6349     41158
1772071017_A05             6335                    47              11.1238     56950
1772071014_B06             7032                    43              12.8999     54512
1772067065_H06             5931                    39              17.1908     34501
1772071017_E02             6671                    43              13.3543     49954
> # Changing column name for this example
> colnames(colData(sce))[colnames(colData(sce))=="detected"] <- "num_genes"
> colnames(colData(sce))[colnames(colData(sce))=="subsets_mito_percent"] <- "percent_mito"
> head(colData(sce))
DataFrame with 6 rows and 21 columns
                    tissue   group # total mRNA mol      well       sex       age  diameter
               <character> <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
1772071015_C02    sscortex         1          21580        11         1        21      0.00
1772071017_G12    sscortex         1          21748        95        -1        20      9.56
1772071017_A05    sscortex         1          31642        33        -1        20     11.10
1772071014_B06    sscortex         1          32916        42         1        21     11.70
1772067065_H06    sscortex         1          21531        48         1        25     11.00
1772071017_E02    sscortex         1          24799        13        -1        20     11.90
                level1class level2class       sum num_genes subsets_mito_sum subsets_mito_detected
                <character> <character> <numeric> <integer>        <numeric>             <integer>
1772071015_C02 interneurons       Int10     22354      4871              774                    23
1772071017_G12 interneurons       Int10     22869      4712             1121                    27
1772071017_A05 interneurons        Int6     32594      6055              952                    27
1772071014_B06 interneurons       Int10     33525      5852              611                    28
1772067065_H06 interneurons        Int9     21694      4724              164                    23
1772071017_E02 interneurons        Int9     25919      5427             1122                    19
               percent_mito altexps_repeat_sum altexps_repeat_detected altexps_repeat_percent
                  <numeric>          <numeric>               <numeric>              <numeric>
1772071015_C02     3.462468               8181                     419                21.9677
1772071017_G12     4.901832              11854                     480                28.8012
1772071017_A05     2.920783              18021                     582                31.6435
1772071014_B06     1.822521              13955                     512                25.5999
1772067065_H06     0.755969               6876                     363                19.9299
1772071017_E02     4.328871              17364                     618                34.7600
               altexps_ERCC_sum altexps_ERCC_detected altexps_ERCC_percent     total
                      <numeric>             <numeric>            <numeric> <numeric>
1772071015_C02             6706                    43              18.0070     37241
1772071017_G12             6435                    46              15.6349     41158
1772071017_A05             6335                    47              11.1238     56950
1772071014_B06             7032                    43              12.8999     54512
1772067065_H06             5931                    39              17.1908     34501
1772071017_E02             6671                    43              13.3543     49954
> plotMetrics(sce, detected="num_genes", subsets_mito_percent = "percent_mito")
Warning message:
In geom_point(colour = palette) :
  All aesthetics have length 1, but the data has 3005 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
> mixtureModel(sce, detected="num_genes", subsets_mito_percent = "percent_mito")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
JTumulty commented 2 hours ago

In addition this issue should be fixed in the other plotting functions: plotModel() and plotFiltering() where similar ggplot2 code is used. A proposed simple solution is to automatically change the column names provided by the user to "detected" and "subset_mito_percent" before plotting or providing to the flexmix function.