pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
658 stars 172 forks source link

lr-kallisto quant-tcc seg fault with bulk ONT #463

Open sbresnahan opened 2 months ago

sbresnahan commented 2 months ago

Version: kallisto 0.51.1

I'm following a workflow outlined in issue 456 for using lr-kallisto with bulk ONT. kallisto bus, bustools sort, and bustools count steps complete without errors. However, the kallisto quant-tcc step is being dumped by LSF with 554689 Segmentation fault shortly after processing sample/cell N.

I'm using a kallisto index with kmer-length=63 built from transcripts pulled from the GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta and gencode v45 gtf using gffread. An index built from these transcripts with kmer-length=31 have no issues with kallisto quant using short reads.

bound-to-love commented 2 months ago

Hi, Sean, since you are processing bulk, it should only print out processing sample/cell 0; is this the case? Can you please post the full output?

sbresnahan commented 2 months ago

If I run with --threads=1, it is indeed only processing sample/cell 0 before the seg fault:

[index] k-mer length: 63
[index] number of targets: 252,723
[index] number of k-mers: 157,178,936
[index] number of equivalence classes loaded from file: 327,292
[tcc] Parsing transcript-compatibility counts (TCC) file as a matrix file
[tcc] Matrix dimensions: 72 x 327,292
[quant] Running EM algorithm...
[   em] reading priors from file ONT
[quant] Processing sample/cell 0
/home/stbresnahan/.lsbatch/1727389319.16590285.shell: line 39: 55903 Segmentation fault     (core dumped) kallisto quant-tcc -t 1 --long -p ONT -f ${DIR_OUT}/flens.txt -i kallisto_index/gencode_v45 -e ${DIR_OUT}/count.ec.txt -o ${DIR_OUT}/quant-tcc ${DIR_OUT}/count.mtx

However, if I set --threads to anything other than 1 (in this case, 12), it is:

[index] k-mer length: 63
[index] number of targets: 252,723
[index] number of k-mers: 157,178,936
[index] number of equivalence classes loaded from file: 327,292
[tcc] Parsing transcript-compatibility counts (TCC) file as a matrix file
[tcc] Matrix dimensions: 72 x 327,292
[quant] Running EM algorithm...
[   em] reading priors from file ONT
[quant] Processing sample/cell 0quant] Processing sample/cell [quant] Processing sample/cell 2[quant] Processing sample/cell [quant] Processing sample/cell quant] Processing sample/cell 5
[quant] Processing sample/cell 3[quant] Processing sample/cell [quant] Processing sample/cell 6
[quant] Processing sample/cell 4
[quant] Processing sample/cell 77

[quant] Processing sample/cell 88
[[[

quant] Processing sample/cell 11
[quant] Processing sample/cell 9quant] Processing sample/cell [quant] Processing sample/cell 11uant] Processing sample/cell [quant] Processing sample/cell [quant] Processing sample/cell 1
0
0

/home/stbresnahan/.lsbatch/1727384386.16588742.shell: line 38: 3476442 Segmentation fault     (core dumped) kallisto quant-tcc -t 12 --long -p ONT -f ${DIR_OUT}/flens.txt -i kallisto_index/gencode_v45 -e ${DIR_OUT}/count.ec.txt -o ${DIR_OUT}/quant-tcc ${DIR_OUT}/count.mtx

This occurs regardless of whether I start the process with a single .fastq or multiple .fastq files.

Yenaled commented 1 month ago

@sbresnahan can you post the exact commands you’re running?

And can you try the official binaries on the Releases page to make sure it’s not a compilation error?

kallisto_LongKmer_NoOpt-v0.51.1.tar.gz

sbresnahan commented 1 month ago

@sbresnahan can you post the exact commands you’re running?

And can you try the official binaries on the Releases page to make sure it’s not a compilation error?

kallisto_LongKmer_NoOpt-v0.51.1.tar.gz

Building transcriptome index:

gffread -F -w GCA_000001405.15_GRCh38_no_alt_analysis_set_gencode_v45.fasta \
   -g GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
   gencode.v45.annotation.gtf

kallisto index -k 63 -t 10 -i gencode_v45 GCA_000001405.15_GRCh38_no_alt_analysis_set_gencode_v45.fasta

Running lr-kallisto:

kallisto bus -t 8 --long --threshold 0.8 -x bulk -i gencode_v45 \
  -o kallisto_out fullLength.and.rescued.fastq 

bustools sort -t 8 kallisto_out/output.bus \
 -o kallisto_out/sorted.bus

bustools count kallisto_out/sorted.bus \
 -t kallisto_out/transcripts.txt \
 -e kallisto_out/matrix.ec \
 -g kallisto_out/gencode_v45_tx2g.tsv \
 -o kallisto_out/count --cm -m

kallisto quant-tcc -t 8 \
    --long -p ONT -f kallisto_out/flens.txt \
    -i kallisto_index/gencode_v45 \
    -e kallisto_out/count.ec.txt \
    -o kallisto_out/quant-tcc \
    --matrix-to-files \
    kallisto_out/count.mtx

I will try the linked binary and get back to you.

MustafaElshani commented 3 weeks ago

I do get a similar error line 76: 5394 Segmentation fault I have tried both compiling myself and using the @Yenaled

[index] k-mer length: 63
[index] number of targets: 385,659
[index] number of k-mers: 186,649,435
[index] number of equivalence classes loaded from file: 193,836
[tcc] Parsing transcript-compatibility counts (TCC) file as a matrix file
[tcc] Matrix dimensions: 1 x 193,836
[quant] Running EM algorithm...
[   em] reading priors from file ONT
[quant] Processing sample/cell 0
/var/spool/slurm/job23490963/slurm_script: line 76:  5394 Segmentation fault      (core dumped) $SCRATCH/bioinformatic_tools/kallisto/kallisto/kallisto_linux-v0.51.1_kmer64 quant-tcc --long -p ONT -t $SLURM_CPUS_PER_TASK -i "$INDEX_PATH" -o "$OUTPUT_DIR/$SAMPLE_NAME" --matrix-to-files -f "$OUTPUT_DIR/$SAMPLE_NAME/flens.txt" -e "$OUTPUT_DIR/$SAMPLE_NAME/count.ec.txt" "$OUTPUT_DIR/$SAMPLE_NAME/count.mtx"
Is this an issue mainly with with `v0.51.1?
Yenaled commented 3 weeks ago

Very strange — quant-tcc seems to have issues with the input files supplied. If you are able to upload the files somewhere (the files supplied to quant-tcc) and email them to me, I can help debug.

MustafaElshani commented 3 weeks ago

Sorted it It was the -p I was reading https://pachterlab.github.io/kallisto/manual where the -p was for platform while actually its -P for platform

Yenaled commented 3 weeks ago

Oh good catch! And yay!