TPM values depend on Combination

keksundso commented 1 year ago

Using sleuth I can calculate the TPM-Value of a gene in a Sample (sleuth_prep followed by sleuth_to_matrix).

Now if I have 6 samples (6 abundance tables from Kallisto) with two belonging to on of the three conditions P, G and H, I would expect to get one TPM-Value per gene per sample.

However, the TPM-Value for Gene i in sample j is not a fixed value, but i varies depended on the combinations of conditions which go into sleuth_prep.

E.g. gene i in P1 has a different TPM-value when condition P and G goes into sleuth_prep compared to gene i in P1 with the condition P and H. See minimal example below:

getSleuthObj <- function(s2c){
    transcript2gene <- read_delim(transcript2genePath, delim = "\t", col_names=c("target_id","ens_gene","ext_gene"),show_col_types = FALSE )
    sleuth.obj <- sleuth_prep(sample_to_covariates = s2c, 
                              target_mapping = transcript2gene, 
                              extra_bootstrap_summary = TRUE,
                              read_bootstrap_tpm = TRUE, 
                              aggregation_column = 'ens_gene',
                              num_cores = numberOfCores,
                              gene_mode = TRUE
    )

    tpms <- sleuth_to_matrix(sleuth.obj, "obs_norm", "tpm")

    tpms <- as.data.frame(tpms)
    tpms$ens_gene <- rownames(tpms)
    tpms$ext_gene <- sleuth.obj$target_mapping$ext_gene[match(tpms$ens_gene, sleuth.obj$target_mapping$ens_gene)]
    rownames(tpms) <- NULL

    return(tpms)
    }

sleuth_PG <- getSleuthObj(rbind( 
    data.frame(sample = c("P1","P2"),
               condition = "P",
               path = c("/home/keks/app/data/P1","/home/keks/app/data/P2"),
               stringsAsFactors = FALSE)
    ,
    data.frame(sample = c("G5","G6"),
               condition = "G",
               path = c("/home/keks/app/data/G5","/home/keks/app/data/G6"),
               stringsAsFactors = FALSE)
))
sleuth_PH <- getSleuthObj(rbind( 
    data.frame(sample = c("P1","P2"),
               condition = "PC",
               path = c("/home/keks/app/data/P1","/home/keks/app/data/P2"),
               stringsAsFactors = FALSE)
    ,
    data.frame(sample = c("H3","H4"),
               condition = "H",
               path = c("/home/keks/app/data/H3","/home/keks/app/data/H4"),
               stringsAsFactors = FALSE)
))

mschilli87 commented 1 year ago

I didn't take a deep look but my first guess would be that depending on your input, a different set of transcripts passes sleuth's internal filters, resulting in a different total number of reads and features being used as the basis for the TPM normalization. If you have less features, all remaining features will end up with higher numbers because TPM always sum up to 1 M. So maybe you can get the behaviour that you want by manually overwriting the filter settings.

keksundso commented 1 year ago

This sounds reasonable, especially since the variation is a constant shift proportional to the tpm values as it would be expected by a different total read number. I replaced the filter function by a custom one which should not filter anything:

myFilter <- function (row, min_reads = 0, min_prop = 0) 
{
    mean(row >= min_reads) >= min_prop
}

sleuth.obj <- sleuth_prep(sample_to_covariates = s2c, 
                           target_mapping = transcript2gene, 
                           extra_bootstrap_summary = TRUE,
                           read_bootstrap_tpm = TRUE, 
                           aggregation_column = 'ens_gene',
                           num_cores = numberOfCores,
                           gene_mode = TRUE,
                           filter_fun = myFilter

As expected, in both combinations the same number of targets and genes now pass the filter. However, the difference in the tpm values between both combinations persist.

pachterlab / sleuth

TPM values depend on Combination #282