refresh-bio / colord

A versatile compressor of third generation sequencing reads.
GNU General Public License v3.0
46 stars 12 forks source link

Segmentation fault (core dumped) #9

Closed raphaelbetschart closed 7 months ago

raphaelbetschart commented 7 months ago

Hi, I am trying to compress a PacBio HiFi GIAB sample (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/HG006_NA24694-huCA017E_father/PacBio_CCS_15kb_20kb_chemistry2/uBAMs/m64017_191213_003759.hifi_reads.bam). With this specific sample I always get a "Segmentation fault (core dumped)" message during or after "Counting k-mers". I use the following command:

colord compress-pbhifi --qual org --threads 8 --reference-genome hs38DH.fa m64017_191213_003759.hifi_reads.fastq.gz m64017_191213_003759.hifi_reads.fastq.colord

The BAM file was converted to fastq.gz with the pbtk bam2fastq (from here: https://github.com/PacificBiosciences/pbtk#bam2fastx).

I am using colord 1.2.0.

Other samples worked fine, but I am having trouble with this specific one. Any ideas?

marekkokot commented 7 months ago

Hi,

I cannot reproduce this :( This is how I run it:

./pbindex ../m64017_191213_003759.hifi_reads.bam
./bam2fastq -o m64017_191213_003759.hifi_reads ../m64017_191213_003759.hifi_reads.bam

The first run was without ref seq:

colord/bin/colord compress-pbhifi --qual org --threads 8 m64017_191213_003759.hifi_reads.fastq.gz m64017_191213_003759.hifi_reads.fastq.colord
Counting k-mers.
Stage 1: 100%
Stage 2: 100%
Filtering k-mers.
100%
Running compression.
100%
DNA size        : 968988426
Quality size    : 7181648469
Header size     : 1203516
Meta size       : 54
Info size       : 203
Total time      : 948.253s

And for the second, I downloaded ref. seq (Is this the same file you have used?):

wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
mv GRCh38_full_analysis_set_plus_decoy_hla.fa hs38DH.fa

And then:

/usr/bin/time -v colord/bin/colord compress-pbhifi  --qual org --threads 8 --reference-genome hs38DH.fa m64017_191213_003759.hifi_reads.fastq.gz m64017_191213_003759.hifi_reads.fastq.colord+ref
Counting k-mers.
Stage 1: 100%
Stage 2: 100%
Filtering k-mers.
100%
Running compression.
100%8%
DNA size        : 279460182
Quality size    : 7180109405
Header size     : 1203515
Meta size       : 83
Info size       : 236
Total time      : 3842.36s
        Command being timed: "colord/bin/colord compress-pbhifi --qual org --threads 8 --reference-genome hs38DH.fa m64017_191213_003759.hifi_reads.fastq.gz m64017_191213_003759.hifi_reads.fastq.colord+ref"
        User time (seconds): 21653.69
        System time (seconds): 81.94
        Percent of CPU this job got: 565%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04:02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12528628
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 48
        Minor (reclaiming a frame) page faults: 18038895
        Voluntary context switches: 3727105
        Involuntary context switches: 22179
        Swaps: 0
        File system inputs: 7128
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I have run this on WSL. Let me know your operating system and hardware environment and if you can try it in a different environment. Best Marek

raphaelbetschart commented 7 months ago

Hi Marek, thanks for your reply. I can get colord to run without a reference genome, but as soon as I specify one I get the Segmentation fault (I've tried the one you mentioned, plus hs38DH.fa and the standard hg38.fa). Interestingly, it works when I specify the reference genome AND only use a single thread. Two and three threads works fine too, but more than 4 leads to the Segmentation fault.

I'm running it on Rocky Linux 9.2, with AMD Epyc 7742 CPUs.

Best, Raphael

Edit: I have the following md5sum: 3c0a0006322b140e6e39bb02cdf207a2 m64017_191213_003759.hifi_reads.fastq.gz

marekkokot commented 7 months ago

Hi Raphael,

My md5sum is the same. I am able to reproduce this on another machine. I hope I will be able to fix this as fast as possible.

marekkokot commented 7 months ago

Hi Raphael,

It should now be fixed with 3e87a2240adc622a8262b50adf9ca1f87ddc6a1f Please try to verify this. I have also created a new release (1.2.1) containing this fix if you, for some reason, cannot compile the code. Let me know if it works in your environment now.

raphaelbetschart commented 7 months ago

Hi @marekkokot,

I can confirm that the bug is fixed, thanks for the quick fix.

Best, Raphael

marekkokot commented 7 months ago

Great, I am closing this.