Closed gauravgadhvi closed 4 months ago
Memory is not an issue — there’s no need to specify anything for -m (in fact, I recommend that you don’t; kb count usually takes in the single digits worth of memory).
The issue is bustools count hanging (it has indeed entered a bad state). This can sometimes happen if your t2g.txt file is not formatted correctly (how did you create it / how was kb ref run?).
If you email me the output.unfiltered.bus from one the failed runs along with the t2g.txt, matrix.ec, and transcripts.txt file, I can diagnose the issue and tell you exactly what’s wrong (maybe the transcripts.txt is in a different order than the transcripts in the t2g.txt file? The fact that even the successful run says there are “duplicate entries” is cause for concern). Otherwise, re-examine how kb ref was run / how the index+t2g files were created.
I created an index using custom sequences in a fasta file and running kallisto index
to get the index.
Since I didn't have any GTF file or coordinates for these sequences, I couldn't run kb ref. Upon further discussion Sina had suggested to create a dummy t2g.txt file with names of each of the contigs from the fasta file in two columns. So I have been using the t2g.txt file along with the kallisto created index to run kb count.
I have emailed you the files you asked for along with the t2g.txt and the fasta file for reference. Thank you for helping with the diagnosis of the issue.
OK, your problem is indeed with the t2g.txt file (problems with this file are the only reason I have ever seen bustools count hang).
The following transcripts are duplicated in the first line of the t2g.txt file: U13, tRNA-Arg-CGA, tRNA-His-CAY, tRNA-Thr-ACG, tRNA-Thr-ACY
If you look in your transcripts.txt file, they actually have distinct names (i.e. the first U13 is named "U13", the second "U13" is named "U13_1", etc.).
You must be super careful and make absolutely sure that your t2g.txt and the generated transcripts.txt file are perfectly concordant. This means that your transcripts.txt file must be 100% identical to the t2g.txt file (exact same order, exact same names).
Ideally, I'd have the program print out an error (or at least a warning message) when such a mismatch exists, but I never got around to doing so.
That is really helpful to know about the exact matches in the t2g.txt and the transcripts.txt. I saw a warning message showing the de-duplication of conflicting transcript IDs and assumed it would be implicitly taken care of in the background. I will create a new index with the updated contig names and retry processing my data.
Thank you so much for your help and suggestion! :)
That worked like magic, with memory less than 10GB required even for larger samples! Thank you again!
I am trying to quantify RNA reads from SlideSeqV2 data using a custom technology string (-x ) and a kallisto index (version 0.50.1) with help of the kb count (_kbpython 0.28.2) command. The total size of raw FASTQ files for each of my samples are ranging from 70GB to 150GB.
For a few samples, I was able to get results from kb count. Whereas for the majority of samples, it runs out of memory or just gets into a phase at the bustools count command where it never finishes running this step. For these samples, kb ran for 3-4 days and still didn't move ahead of the bustools count step.
Initially, I reckoned it is a memory issue and reduced the -m flag requested memory to half of the requested cluster memory, so as to allow kallisto and bustools binaries to have enough memory available, but no luck there either. Reducing the memory limit for kb allowed to remove any _badalloc errors, although it still won't move ahead of the bustools count step.
Here is an example of the command run and the verbose output log is what follows the command :
kb count --h5ad --verbose --strand unstranded -m 180G -i ./MUS_TEFamily_kb28_index.idx -g ./MUS_TEfam_custom_t2g.txt -x 0,0,8,0,26,32:0,32,41:1,0,60 -t 10 -w ./cleanBarcodes_whitelist.txt -o ./ ./210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R1.fastq.gz ./210817_Puck_210817_11.L001/210817_Puck_210817_11.L001.R2.fastq.gz ./210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R1.fastq.gz ./210817_Puck_210817_11.L002/210817_Puck_210817_11.L002.R2.fastq.gz ./210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R1.fastq.gz ./210817_Puck_210817_11.L003/210817_Puck_210817_11.L003.R2.fastq.gz ./210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R1.fastq.gz ./210817_Puck_210817_11.L004/210817_Puck_210817_11.L004.R2.fastq.gz
Output Log :
Here is an output log for one of the samples for which kb count successfully ran without being stuck so we know the command works for this dataset :
Do you have any suggestions for why bustools count takes so long or if it might've entered a bad state?
I am not sure which part of the kb count pipeline demands the most amount of memory and what is the suggested memory requirement for large FASTQ samples (such as 150G)? Is there any way to speed up the processing by using more threads or processors? Or is it dependent on the total RAM allocated to the job?
Thank you in advance for helping me figure this out so we can optimize our kb-python jobs and runtime!
-Gaurav