Seg fault on kallisto pseudo only when supplying fastq pairs in batch file

cypranowska commented 5 years ago

I'm getting a segmentation fault when I run pseudo with the -b switch, and am not sure why. For example, when I run kallisto pseudo -i ../ref/drosophila_melanogaster/transcriptome.idx -o ../test -b ../ref/test_batch.txt, I get the following out:

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 30,839
[index] number of k-mers: 33,250,495
[index] number of equivalence classes: 58,781
[quant] running in paired-end mode
[quant] will process pair 1: ../../fastq_trim/0212_ok6_pop_2_S99_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0212_ok6_pop_2_S99_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0302_p2-d_S8_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0302_p2-d_S8_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0302_p2-g_S9_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0302_p2-g_S9_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p1-a3_S10_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p1-a3_S10_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p1-a8_S11_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p1-a8_S11_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p1-b4_S12_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p1-b4_S12_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p1-b8_S13_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p1-b8_S13_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p2-a3_S19_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p2-a3_S19_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p2-a6_S33_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p2-a6_S33_L001_R2_001.trim.fastq.gz
[quant] will process pair 1: ../../fastq_trim/0507_p2-b2_S16_L001_R1_001.trim.fastq.gz
                             ../../fastq_trim/0507_p2-b2_S16_L001_R2_001.trim.fastq.gz
[quant] finding pseudoalignments for all files ...Segmentation fault

But if I provide the first pair of .fastq files at the command line, instead of in the batch file, I'll get the expected output. I'd like to avoid looping through all of my samples in my batch script if I can. I'm using version 0.46.0.

pmelsted commented 5 years ago

Can you show the contents of the batch file? How soon does this happen in the process, right after starting the program or after some time? Does this still happen if you use only a part of the batch file, say the first two lines?

cypranowska commented 5 years ago

I tried again with just the first two lines, and I get the same error. The error usually happens ~5 seconds after starting the program. I've attached my batch file here. I'm doing this on a computing cluster, and not on a Linux box, if that info helps at all.

test_batch.txt

hmassalha commented 5 years ago

Dear @pmelsted,

I have 6 samples each with a different number of cells. Can I add all files R1 and R2 in a batch for the kallisto bus command? douse that will be counted as a separated input? In the next command bustools correct I will prepare a whitelist with all possible combinations of cell barcodes.

Thanks, HM

pmelsted commented 5 years ago

@cypranowska Can you check if any of the reads are of zero length, this can happen with trimming and trips kallisto up.

You can use the following awk command to count the number of blank lines in your fastq files

zcat file.fastq.gz | awk '/^$/ {x+=1} END {print x}'

pmelsted commented 5 years ago

@hmassalha pseudo is not compatible with bus, please use the mailing list for information on how to process the data.

cypranowska commented 5 years ago

@pmelsted I looked at my .fastq.gz files and there aren't zero length reads in any of them.

tagtag commented 5 years ago

I have just experienced the seemingly same problem.

% ~/kallisto-0.46.0/build/src/kallisto pseudo -i ~/kallisto-0.46.0/mm10/transcripts.idx -o output -b batch1.txt

[quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 41,604 [index] number of k-mers: 66,826,372 [index] number of equivalence classes: 99,573 [quant] running in paired-end mode [quant] will process pair 1: ./R1/E10_2_1_R1.gz ./R2/E10_2_1_R2.gz [quant] will process pair 1: ./R1/E10_2_10_R1.gz ./R2/E10_2_10_R2.gz [quant] will process pair 1: ./R1/E10_2_11_R1.gz ./R2/E10_2_11_R2.gz [quant] finding pseudoalignments for all files ...Segmentation fault (コアダンプ)

% ~/kallisto-0.46.0/build/src/kallisto pseudo -i ~/kallisto-0.46.0/mm10/transcripts.idx -o output ./R1/E10_2_1_R1.gz ./R2/E10_2_1_R2.gz

[quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 41,604 [index] number of k-mers: 66,826,372 [index] number of equivalence classes: 99,573 [quant] running in paired-end mode [quant] will process pair 1: ./R1/E10_2_1_R1.gz ./R2/E10_2_1_R2.gz [quant] finding pseudoalignments for the reads ... done [quant] processed 1,521,470 reads, 1,221,252 reads pseudoaligned

% more batch1.txt E10_2_1 ./R1/E10_2_1_R1.gz ./R2/E10_2_1_R2.gz E10_2_10 ./R1/E10_2_10_R1.gz ./R2/E10_2_10_R2.gz E10_2_11 ./R1/E10_2_11_R1.gz ./R2/E10_2_11_R2.gz

Segfaults occur only when batch mode is employed. I have performed the above process on CentOS Linux release 7.2.1511 (Core) I have tried pre-compiled 0.45.0, 0.46.0 source compiled 0.46.0 But I could not avoid the segfault. When using single-end fastq file, segfaults does not occur only when 0.46.0 is used.

pmelsted commented 4 years ago

This issue has been fixed in the development branch.

pachterlab / kallisto

Seg fault on kallisto pseudo only when supplying fastq pairs in batch file #227