tpq / propr

[OFFICIAL] [CoDA] An R package to calculate proportionality and other measures for compositional data
69 stars 10 forks source link

Problem using select? #66

Open audrey-bollas opened 21 hours ago

audrey-bollas commented 21 hours ago

Hello, thanks for the great tool. I am using it on a set of RNA-Seq data with ~55k genes (features). I would like to use the select argument to dynamically filter the features to reduce time/resources. Here is my command and an example of my data:

# Command
pr <- propr(
  counts,  # rows as samples, like it should be
  metric = "pcor.bshrink",  # partial correlation without shrinkage "pcor" is also available
  ivar = "clr",  # "clr" is recommended
  p = 100,  # used for permutation tests
  select = filtered_indices
) 

# Counts matrix
 > dim(counts)
[1]   294 55006
> class(counts)
[1] "matrix" "array" 

# Feature indices
> length(filtered_indices)
[1] 32304
> filtered_indices[1:5]
[1] 1 2 3 4 5

I am getting this error:

> pr <- propr(
+   counts,  # rows as samples, like it should be
+   metric = "pcor.bshrink",  # partial correlation without shrinkage "pcor" is also available
+   ivar = "clr",  # "clr" is recommended
+   p = 100,  # used for permutation tests
+   select = filtered_indices
+ ) 
Alert: Log-ratio transform will be handled by 'bShrink'.
Alert: replacing zeros with minimun value.
Alert: Skipping built-in log-ratio transformation.
Error in if (!is.na(select)) { : the condition has length > 1

I think it is expecting select to be a single element of length 1. Which is true if it is NA. But the if statement is not vectorized so it won't work with the added argument. The docs specify select should be "A numeric vector representing the indices of features to be used for computing the Propr matrix. This argument is optional. If provided, it reduces the data size by using only the selected features."

Is there a problem with the if statement or am I doing something wrong? Thanks so much!!


> R.version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.3                         
year           2024                        
month          02                          
day            29                          
svn rev        86002                       
language       R                           
version.string R version 4.3.3 (2024-02-29)
nickname       Angel Food Cake  

> packageVersion("propr")
[1] ‘5.1.5’
audrey-bollas commented 18 hours ago

I made a pull request to fix the if statement, in case this is the problem. I tested it on several examples. But I am not an expert in R by any means, so I would not consider it bullet proof.

Anyway, after looking at the code it seems like it filters the log counts matrix before doing any computation. Is this doing the same thing as just filtering the counts matrix before running the propr command? I assumed select was doing something described in the propr paper:

We begin by constructing the proportionality matrix using all 57,580 transcript counts, yielding an N2 matrix 24.7 Gb in size. To minimize the number of lowly expressed transcripts included in the final result, we subset the matrix to include only those transcripts with at least 10 counts in at least 10 samples. By removing the features at this stage, we can exploit a computational trick to calculate proportionality and filter simultaneously, reducing the required RAM to only 5 Gb without altering the resultant matrix. Next, in the absence of a hypothesis testing framework, we arbitrarily select those “highly proportional” transcripts with ρp > 0.95. We refer the reader to the supplementary vignette for a justification of this cutoff (S1 Appendix). When plotting the pairwise log-ratio transformed abundances for these “highly proportional” transcript pairs, a smear of straight diagonal lines confirms that the feature pairs indexed as proportional actually show proportional abundance

Particularly the bolded/italicized section. I would guess removing the genes before calculating proportions will change the results compared to if we remove them after. Can you offer any insight to this? And do you have any suggestions for speeding up the computation as mentioned in the paper, here?

Thanks!