Demultiplexing and collapsing takes days to complete for at-scale experiment

jeremymsimon commented 4 years ago

Hi- We finally generated some full-scale SPLiT-seq data, and the runtime for this tool is quite large. We have ~850M reads; the initial demultiplexing step as well as the collapsing ODT/random hexamers each take days to complete.

Are there any performance improvements you might be able to make to get this tool to scale better with data input size? zUMIs by comparison can do most of this in hours or less...

Thanks!

paulranum11 commented 4 years ago

Hi Jeremy,

Yes i do actually have an update coming that will speed up the ODT/randomhexamer collapse. I will try to finish validation for that update and get it out the door in the next day or two.

I don't have a solution for making the initial demultiplexing faster at the moment...

Did the zUMIs solution work fo you? I haven't tried their pipeline for SPLiT-Seq.

Also the number of cores and amount of memory you allocate will dramatically impact runtime... If you are working on a cluster with plenty of resources you will be able to get the fastest runtimes.

Thanks!

Paul

On Wed, Jan 15, 2020 at 1:54 PM jeremymsimon notifications@github.com wrote:

Hi- We finally generated some full-scale SPLiT-seq data, and the runtime for this tool is quite large. We have ~850M reads; the initial demultiplexing step as well as the collapsing ODT/random hexamers each take days to complete.

Are there any performance improvements you might be able to make to get this tool to scale better with data input size? zUMIs by comparison can do most of this in hours or less...

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paulranum11/SPLiT-Seq_demultiplexing/issues/5?email_source=notifications&email_token=AEAIDQTYP7AVPLIDTXYR5YLQ55LV7A5CNFSM4KHIIFJ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IGN53KA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAIDQW5XBCZX7IMM6V4A6LQ55LV7ANCNFSM4KHIIFJQ .

jeremymsimon commented 4 years ago

Thanks Paul- I allocated 100G for the initial demultiplexing step, but it didn't seem like that step supported threading so I only opted for one core. Does it indeed have an option I missed for multiple cores?

Would love to apply the update to your collapse step once that's ready!

Edit: I should note that I am running each step independently, ie python demultiplex_using_barcodes.py..., so while I see that -n is an option for the bash wrapper, it wasn't clear that option was used in this first step...but maybe I missed it

paulranum11 commented 4 years ago

Hey Jeremy,

You are right step one is not set up for multithreading. But many of the others are. I will take a look back at step 1 to see if i can think of something to speed it up.

I just pushed the update to dramatically increase the speed of the ODT/randomhexamer collapse.

Hopefully that will be helpful.

Thanks,

Paul

On Wed, Jan 15, 2020 at 2:10 PM jeremymsimon notifications@github.com wrote:

Thanks Paul- I allocated 100G for the initial demultiplexing step, but it didn't seem like that step supported threading so I only opted for one core. Does it indeed have an option I missed for multiple cores?

Would love to apply the update to your collapse step once that's ready!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/paulranum11/SPLiT-Seq_demultiplexing/issues/5?email_source=notifications&email_token=AEAIDQWUXOW6HIYOOISPSM3Q55NSFA5CNFSM4KHIIFJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJBOS5Q#issuecomment-574810486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAIDQR5QYAFE4NZ4H6MDZTQ55NSFANCNFSM4KHIIFJQ .

paulranum11 commented 4 years ago

Jeremy,

Another thing you can do to dramatically increase speed of all steps is raise the threshold for the minimum number of reads from the default 10 to 500 or 1000.

On Wed, Jan 15, 2020 at 2:35 PM Paul Ranum paulranum11@gmail.com wrote:

Hey Jeremy,

You are right step one is not set up for multithreading. But many of the others are. I will take a look back at step 1 to see if i can think of something to speed it up.

I just pushed the update to dramatically increase the speed of the ODT/randomhexamer collapse.

Hopefully that will be helpful.

Thanks,

Paul

On Wed, Jan 15, 2020 at 2:10 PM jeremymsimon notifications@github.com wrote:

Thanks Paul- I allocated 100G for the initial demultiplexing step, but it didn't seem like that step supported threading so I only opted for one core. Does it indeed have an option I missed for multiple cores?

Would love to apply the update to your collapse step once that's ready!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/paulranum11/SPLiT-Seq_demultiplexing/issues/5?email_source=notifications&email_token=AEAIDQWUXOW6HIYOOISPSM3Q55NSFA5CNFSM4KHIIFJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJBOS5Q#issuecomment-574810486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAIDQR5QYAFE4NZ4H6MDZTQ55NSFANCNFSM4KHIIFJQ .

paulranum11 commented 4 years ago

Hi Jeremy,

Charlie Whitmore found some easy to implement speed savings in step 1. It should translate to ~3X speed increase if you are using an error threshold of 1 and a ~5X increase if you are using an error threshold of 2.

We are starting to think through how to multithread step 1 and limit creation (and removal) of files for barcodes below threshold numbers of reads. So hopefully we will have more speed increases to step 1 in the future.

jeremymsimon commented 4 years ago

Awesome! To note, the new collapse script you committed recently is MUCH faster. The old version was terminated after 3.5 days of runtime, the new version finished in under an hour. So that was a significant improvement!

I will switch to the new step1 version for now and hope for further improvements/threading in the future!

Thanks!

dumaatravaie commented 4 years ago

Hi Paul,

         Thank you for adding me as a contributor for your splitseq pipeline in github. I really appreciate that.

I am also working with a new splitseq dataset of our lab now and hopefully i will make more improvement with this pipeline. Thank you Again and i also wish you a belated happy new year.

Thank you With best wishes Dipankar

De : jeremymsimon notifications@github.com Envoyé : vendredi 17 janvier 2020 20:08:18 À : paulranum11/SPLiT-Seq_demultiplexing Cc : Subscribed Objet : Re: [paulranum11/SPLiT-Seq_demultiplexing] Demultiplexing and collapsing takes days to complete for at-scale experiment (#5)

Awesome! To note, the new collapse script you committed recently is MUCH faster. The old version was terminated after 3.5 days of runtime, the new version finished in under an hour. So that was a significant improvement!

I will switch to the new step1 version for now and hope for further improvements/threading in the future!

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/paulranum11/SPLiT-Seq_demultiplexing/issues/5?email_source=notifications&email_token=ACJAUAKSCAPH5RBGP2JUTSLQ6H62FA5CNFSM4KHIIFJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIVNMA#issuecomment-575755952, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACJAUANDBIH6AXMQVPUHO63Q6H62FANCNFSM4KHIIFJQ.

paulranum11 commented 4 years ago

Hi Dipankar,

Thanks for your speed improvements to the ODT/RanHex collapse step! Please let me know if you have any ideas improving Step1.

As you and Jeremy are working with large-scale datasets I am curious to know how well downstream single-cell analysis is working out for you both.

Are you getting fairly decent % of reads assigned to genes?
Is downstream clustering working out as expected?
What sort of lower and upper genes per cell thresholds are you setting?

Thanks again for all your contributions! This tool keeps improving because of users like you guys.

charliewhitmore28 commented 4 years ago

Hello,

The 3x to 5x speed increase I had put in was targeted specifically towards the time spent searching for the barcodes - which turned out to be only half the picture. I just pushed up another small update to speed up the other half of the picture a bit (saving the data for future steps).

I also added a few fields for "barcode index constraints" at the top of the demultiplex file under a section called Advanced Configuration. A good chunk of time is most likely spent searching the spacers between the barcodes unnecessarily, which could be a big deal at scale - if the data is consistent enough, could use these fields to limit the scope of where in each read we actually search for the barcodes and speed this up even more. Less robust and a little annoying to configure but could help with speed significantly.

Might also be worth a shot trying this step with a smaller memory threshold. Depending on the machine, having this set too high could have an adverse effect.

dumaatravaie commented 4 years ago

Hi Paul, For step1 i am also thinking about multi threading. I will develop and test my ideas soon. But, right now i am busy analyzing the new SplitSeq data of my lab. I will let you know in details about the analysis once i finish.

Thank you

jeremymsimon commented 4 years ago

Hi @paulranum11, I don't yet have specific answers to your earlier questions, but for this run I am seeing that a lot of cells have extremely low depth. Once I get the data into Seurat, I start with ~315K cells, and any nominal filter (nCount>50 & nFeature>50) causes that to crash down to just ~25K cells. I need to trace through my steps and make sure I'm not introducing any issues during or after processing by this pipeline, but it does concern me that there may be a number of "cells" per actual cell; ie some barcodes weren't collapsed that needed to be.

I will keep digging into this more, and will probably run zUMIs on the side to see how it all compares, but if you have any suggestions on metrics to check along the way, I'd be happy to look.

dumaatravaie commented 4 years ago

Hi, Here are some details of analysis of our sequences ( August 2019 ). Using star and gtf file.

Read Details

We had a total of 121609888 reads ( human and mouse ). Total Reads after first filter 107144255 (removing the reads which are neither from human or mouse) Total alignments with Genomes 50829362 Assigned Reads ( Unique Heats ) 18666875 Unassigned_NoFeatures 18237207 Unassigned_Ambiguity 1576581 Number of ( post deduplication of UMI ) reads counted 13979757

Cell Details

We have found a total of 46328 cells ( with a threshold of 10 reads per cell ) :

Now for different Threshold of NFeatures: Nfeatures/cell ----- Total Cells 200 ---- 600 100 ---- 1675 50 ---- 2913 40 ---- 3268 30 ---- 3738 25 ---- 4096 10 ---- 8635

Our, Biologists said they put around 6000 cells in the sequencers and they were expecting 3 populations of cells. And, we did found nearly 3 populations of cells in umap after the statistical analysis for each of the threshold of NFeatures 25,30 and 40. For a test run it was quite encouraging for us. Now, i am working with our new sequences.

I hope this might be helpful for you guys with good wishes Dipankar

PS: Did you @jeremysimon ask the biologists about how many cells they did put in the sequencer ?

jeremymsimon commented 4 years ago

@dumaatravaie I have been told it was on the order of 50K cells, so not as far off actually as I thought. I will see what zUMIs converges on and how much that differs by.

dumaatravaie commented 4 years ago

hi @jeremysimon, One thing can also be done before starting the pipeline of splitseq. One can blast the sequences of R1.fastq file against the specific Genome of reference ( origin of the sequence reads ) , and then filter out the reads which had no hit, from both forward and reverse fastq files. Then one can run the pipeline with the filtered reads, which may reduce the number of cells. The other thing is that, the biologists can't expect that all their cells will be amplified perfectly during the PCR process, because of sequencing errors or some other reasons like noise. So, the number of cells the biologists used for sequencing and the number of cells can be retained during the statistical analysis may be completely different. The important step during the Statistical analysis, is to fix the parameters for filtering out the unwanted cells. In many published single cell papers, the parameter for nFeature varies between 100 to 200. In Seurat the default value is 200. But from the original Splitseq paper it is not clear to me exactly what kind of filter parameter they have used ? I am trying to understand more regarding this. I hope this is helpful for you.

with best wishes Dipankar

paulranum11 / SPLiT-Seq_demultiplexing

Demultiplexing and collapsing takes days to complete for at-scale experiment #5

Read Details

Cell Details