Read in files in parallel

thelovelab / tximport

Transcript quantification import for modular pipelines

134 stars 33 forks source link

Read in files in parallel #46

Closed grst closed 3 years ago

grst commented 3 years ago

I am working with a large smart-seq2 dataset with 23k samples (preprocessed with Salmon).

I have seen tximport supports sparse matrices for single cell data, which is nice, but even with readr it takes very long to parse the salmon output files.

I checked the code and it seems that importing simply runs in a loop. Do you think this could be easily replaced with e.g. mclapply?

Best, Gregor

mikelove commented 3 years ago

I think you can just as easily call tximport across cores, e.g. if you want to distribute to 100 cores, have each one read in ~230 samples. You'll have to think about how you want to store and access the data, sparse matrix helps but you will be pushing the limit of in-memory with 23k.

grst commented 3 years ago

We previously ran tximport on all samples individually and merged them separately. But that led to issues with countsFromAbundance = "lengthScaledTPM".

If we used batches of say, 100, would that give reasonable lengthScaledTPM estimates?

mikelove commented 3 years ago

Yes, large randomized batches would be fine, I'd go with 100 or 200 if you can. lengthScaledTPM uses the mean effective transcript length across samples as part of its calculation.

Or you can just focus on fast import (without countsFromAbundance, without tx2gene), and then do the scaling yourself. It's a pretty simple calculation. It's the TPM times the mean effective transcript length across samples (a per feature multiplication), then scale each column (sample) to its original sequencing depth (colSums of counts).

grst commented 3 years ago

great! thanks a lot for your input!