Open medmaca opened 1 month ago
Hey thank you so much! We usually only run the pipeline with 20-40 hybridomas, each of which have ~100k reads, so I never noticed how horribly this step scales as the number of reads increases. I really appreciate you taking the time to share your improved code and have now incorporated it into the pipeline.
Also, just on the topic of performance, I think one other place where things can be optimised is the consensus calling step. Back in 2021 when we developed this pipeline, the combination of racon + medaka was required to generate good consensus sequences, especially when there were only like 5-20 reads per transcript. But now that we have kit 14, dorado v5 basecalling models etc I think it's a bit of overkill. I've had good results in some preliminary tests (100% accuracy achieved with 3-5 reads in most cases) just using abPOA which is super fast. When I have some time in the next few months I'm planning on doing some formal testing of this and will probably ultimately replace the racon + medaka consensus with abPOA. Just wanted to let you know in case that might be helpful for you!
@medmaca @kzeglinski Hi Guys, I am trying out the nabseq workflow and implementing it my lab, I am not a programmer or bio-informatician, I have Dell precession tower work station with RTX4060. What changes should I need to make it work on my computer and any suggestions regarding the cDNA sequencing kit that comes with the minion starter pack. I will use this kit for the first sequencing experiment.
Hey thank you so much! We usually only run the pipeline with 20-40 hybridomas, each of which have ~100k reads, so I never noticed how horribly this step scales as the number of reads increases. I really appreciate you taking the time to share your improved code and have now incorporated it into the pipeline.
Also, just on the topic of performance, I think one other place where things can be optimised is the consensus calling step. Back in 2021 when we developed this pipeline, the combination of racon + medaka was required to generate good consensus sequences, especially when there were only like 5-20 reads per transcript. But now that we have kit 14, dorado v5 basecalling models etc I think it's a bit of overkill. I've had good results in some preliminary tests (100% accuracy achieved with 3-5 reads in most cases) just using abPOA which is super fast. When I have some time in the next few months I'm planning on doing some formal testing of this and will probably ultimately replace the racon + medaka consensus with abPOA. Just wanted to let you know in case that might be helpful for you!
Glad it was useful, and thank you for the heads up regarding abPOA
.
Hi, firstly thank you for creating the nabseq_nf pipeline.
I've been running the nabseq pipeline on Seqera/Tower and noticed that the biggest bottleneck is the
subset_aligned_reads
step, which for larger libraries was taking many days to complete. I have a potential work around which I'll attach here for your consideration which I think significantly speeds up the pipeline and I believe generates the same output.It involves 3 main changes.
Modifying the reference files to remove spaces in headers (replace with @ symbols):
modified_references.zip
This is required so that the SAM files produced in the my minimap2 don't have duplicates sequence names
Modifying minimap2.nf
minimap2.zip
Changing the output from minimap2 from PAF to SAM, only retaining the best matches. This relies on using the modified reference files that have no spaces in the headers to generate SAM files with correct format for use by samtools in the next step.
Modifying select_ab_reads.nf
select_ab_reads.zip
Modified to use samtools to convert from SAM to fastq format.
I hope this is helpful.