satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
208 stars 33 forks source link

apparently wrong logic in function vst #58

Closed dariomel closed 4 years ago

dariomel commented 4 years ago

I am guessing that the following if statement should use any(...), instead of all(...), because the current behavior does not make a lot of sense:

https://github.com/ChristophH/sctransform/blob/81614def2851d690875089528fc17bd6be19d030/R/vst.R#L104

Specifically, if the argument latent_var contains 'log_umi' and the data.frame cell_attr does contain a column named 'log_umi', then setdiff(latent_var, colnames(cell_attr)) yields character(0), which causes setdiff(latent_var, colnames(cell_attr)) %in% known_attr to yield logical(0), which causes all(setdiff(latent_var, colnames(cell_attr)) %in% known_attr) to yield TRUE, which causes the if statement to be executed, which wastes time and RAM by recalculating column 'log_umi' and adding a bunch of unneeded columns to data.frame cell_attr.

ChristophH commented 4 years ago

Thanks for pointing that out. Fixed in a1f6e15

Note that this fix only skips the 'Calculating cell attributes' block if all latent_var are present as column in cell_attr. If that's not the case, there will still be time and RAM used for adding a bunch of columns, namely umi, gene, log_umi, log_gene, umi_per_gene, log_umi_per_gene. These might not always be needed, but were helpful during method development. I'll leave that optimization for some other time.

ChristophH commented 4 years ago

As of 812813e fewer unneeded columns in cell_attr should be created - none if all latent_var are given as part of the input cell_attr.