Closed grst closed 3 years ago
I think you can just as easily call tximport across cores, e.g. if you want to distribute to 100 cores, have each one read in ~230 samples. You'll have to think about how you want to store and access the data, sparse matrix helps but you will be pushing the limit of in-memory with 23k.
We previously ran tximport on all samples individually and merged them separately. But that led to issues with countsFromAbundance = "lengthScaledTPM"
.
If we used batches of say, 100, would that give reasonable lengthScaledTPM
estimates?
Yes, large randomized batches would be fine, I'd go with 100 or 200 if you can. lengthScaledTPM uses the mean effective transcript length across samples as part of its calculation.
Or you can just focus on fast import (without countsFromAbundance, without tx2gene), and then do the scaling yourself. It's a pretty simple calculation. It's the TPM times the mean effective transcript length across samples (a per feature multiplication), then scale each column (sample) to its original sequencing depth (colSums of counts).
great! thanks a lot for your input!
I am working with a large smart-seq2 dataset with 23k samples (preprocessed with Salmon).
I have seen tximport supports sparse matrices for single cell data, which is nice, but even with
readr
it takes very long to parse the salmon output files.I checked the code and it seems that importing simply runs in a loop. Do you think this could be easily replaced with e.g. mclapply?
Best, Gregor