pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
305 stars 95 forks source link

Allow for filter functions that operate on the whole matrix rather than by row #184

Open byoo opened 6 years ago

byoo commented 6 years ago

Hello,

I have RNA sequencing data prepared using rRNA depletion and would like to conduct gene-level differential expression analysis using p-value aggregation. In doing so, I attempted to exclude non coding RNA such as rRNA by specifying target_ids of interest as follows: sleuth_prep(..., filter_target_id=protein_coding_target_ids)

However, it seems to lead to an error in calling sleuth_results

sleuth_results(so_g, 'reduced:full', 'lrt', show_all=T)
Error in sleuth_results(so_g, "reduced:full", "lrt", show_all = T) : 
  The provided weighting function for the mean observations results in negative values, which are not allowed for the lancaster method.

It runs successfully without the filtering. Would you advise how the issue can be resolved? I also wonder if the transcript filtering is correct.

Thank you!

byoo commented 6 years ago

Ok, it looks that specifying filter_target_id made sleuth_prep not filter transcripts based on counts (basic_filter), which is necessary. Calling basic_filter resolved the error.

BTW, my implementation seems less optimal. I couldn't find a way to access transcript_id in filter_fun of sleuth_prep, so I had to run sleuth_prep twice to apply both filters as follows:

biotypefilter <- protein_coding_target_ids
so <- sleuth_prep(full_data, target_mapping=t2g, extra_bootstrap_summary=T,
                  num_cores=1)

txfilter <- so$filter_df$target_id[so$filter_df$target_id %in% biotypefilter]

so <- sleuth_prep(full_data, target_mapping=t2g, extra_bootstrap_summary=T,
                  num_cores=1, filter_target_id=txfilter,
                  pval_aggregate=T, aggregation_column='ens_gene')
...

Have I resolved it in a correct way? I would appreciate it if you would give any advice. Thanks!

warrenmcg commented 6 years ago

Hi @byoo,

The new filter_target_id option is working as described: you provide it a list of target_ids and it will only filter those exact IDs and nothing else. Since you are using a list of target_ids, the filter_fun is NULL. The error is occurring because you included low-abundance features by filtering all protein-coding genes, and any feature with an average (normalized) estimated count of less than 0.5 will yield a negative logarithm, and negative weights are not allowed for the p-value aggregation method.

Your way is correct. A slightly better way to do it is the following:

library(sleuth)
biotypefilter <- protein_coding_target_ids
so <- sleuth_prep(full_data, normalize = FALSE) ## this just reads in the raw data
counts <- sleuth_to_matrix(so, "obs_raw", "est_counts") ## this gets a matrix of counts
filter <- apply(counts, 1, basic_filter) ## this applies the basic filter to the counts
txfilter <- biotypefilter[which(biotypefilter %in% names(filter)[filter])] ## this gives you the filter

so <- sleuth_prep(full_data, target_mapping=t2g, extra_bootstrap_summary=T,
                  num_cores=1, filter_target_id=txfilter,
                  pval_aggregate=T, aggregation_column='ens_gene')

That being said, it may be worth it to reconfigure filtering to take the full matrix instead of forcing the filter function to operate by row. What do you think about that @pimentel?

byoo commented 6 years ago

Thanks a lot for your kind help, @warrenmcg!

m3lorra commented 5 years ago

Hi, I don't know if this is exactly the same issue but I'm getting the same message and I guess there is something that I'm missing.

I'm trying to use filter_fun with the new settings pval_aggregate= TRUE,aggregation_column='ens_gene' in sleuth 0.30.

I want to use filter_fun because I have a study with two genotypes and two treatments, so I have four groups. I didn't have any issue when using sleuth_0.29.

The problem that I get is when I do the LRT testing:

This is my code:

> my_filter= function (row, min_reads = 5, min_prop = 0.20)
{   
    mean(row >= min_reads) >= min_prop
}

> so= sleuth_prep( s2c, 
    target_mapping= t2g, 
    read_bootstrap_tpm= TRUE, 
    extra_bootstrap_summary= TRUE,
    pval_aggregate= TRUE,
    aggregation_column='ens_gene',
    filter_fun=my_filter)

> so= sleuth_fit( so, ~genotype+sex+tto, 'full')
fitting measurement error models
shrinkage estimation
3 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: ENSMUST00000080834.14, ENSMUST00000112010.8, ENSMUST00000127027.1
computing variance of betas
> so= sleuth_fit( so, ~genotype+sex, 'reduced')
fitting measurement error models
shrinkage estimation
3 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: ENSMUST00000080834.14, ENSMUST00000112010.8, ENSMUST00000127027.1
computing variance of betas
> so= sleuth_lrt( so, 'reduced', 'full')
> lrt_res=sleuth_results( so, 'reduced:full', 'lrt', show_all= FALSE)
Error in sleuth_results(so, "reduced:full", "lrt", show_all = FALSE) : 
  The provided weighting function for the mean observations results in negative values, which are not allowed for the lancaster method.

I hope you can help me with this! Thanks

mschilli87 commented 2 years ago

Is there any update on this? I have just finished running a large analysis learning just now at the last step that I need to start over with the filtering. I defined a transcript as expressed if it has six or more raw counts and excluded transcripts not expressed in at least 20 % of cases and 20% of controls. Additionaly, I filtered by biotype since my kallisto index contains rRNAs etc that I want to account for during pseudoalignment but do not care about regarding differential expression. Something like a warning during sleuth_prep concerning too low expressed genes would have been useful already.

mschilli87 commented 2 years ago

@warrenmcg:

Can you confirm that the following is equivalent to your suggestion?

#[...]
so <- sleuth_prep(full_data, read_bootstraps = FALSE, normalize = FALSE) ## read/process as little as nessecary
filter <- so$filter_bool ## no need to get the count matrix and run the basic filter again
#[...]

Any trick to speed things up by filtering down the existing in RAM sleuth object instead of re-reading the data from disk?

Or even better: Why not issue a warning for neg. values during p-value aggregation and return NA results for the corresponding genes instead of failing the entire analysis? AFAICT, even with the basic filter applied, a mean normalized expression below 0.5 could still arise depending on the dataset. In such a case, the user is left with an error message referring to weights they never explicitly specified making this hard to debug. And even with the information provided here, finding an appropriate filter to remove the (unknown) offending targets could be more complecated than it needs to be.