samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Segmentation fault of htslib c program written myself #1708

Closed leon945945 closed 7 months ago

leon945945 commented 7 months ago

Hello, I learned htslib and c program myself due to the slow running speed of PySam, therefore I did't fully master c program and htslib.

I have tried to write some htslib c programs to get reads Tag and alignment position, they ran normally on HiFi bam files aligned with minimap2, but when I applied it to Illumina bam files aligned with bwa, it throwed segmentation fault (core dumped).

Here are my codes:

// getReadsTag.c -- get the alignment tag NM(Total number of mismatches and gaps) and AS(DP alignment score) of reads.
#include <stdio.h>
#include <stdlib.h>
#include "/home/software/htslib/include/htslib/sam.h"
#define bam_is_snd(b) (((b)->core.flag&BAM_FSECONDARY) != 0)
#define bam_is_sup(b) (((b)->core.flag&BAM_FSUPPLEMENTARY) != 0)
int main(int argc, char *argv[])
{
        if (argc != 3) {
                printf("usage:    %s    bamfile    region(chr:start-end)[.]\n", argv[0]);
                //exit(EXIT_FAILURE);
        }

        samFile *fp = sam_open(argv[1], "r");
        hts_idx_t *idx = sam_index_load(fp, argv[1]);
        bam_hdr_t *hdr = sam_hdr_read(fp);
        bam1_t *aln = bam_init1();
        hts_itr_t *itr = sam_itr_querys(idx, hdr, argv[2]);
        //uint32_t * cigar = 0;
        float ratio = 0.0;
        int alnLen = 0;
        int readLen = 0;
        char mark[4];
        //int i = 0;

        printf("readName\tchr\tbegin\tAS\tNM\talignLength\tcigarNum\treadLength\tratio\ttype\n");
        while (sam_itr_next(fp, itr, aln) >= 0) {
                if (aln->core.tid < 0)
                        continue;
                //determine alignment type
                if (bam_is_sup(aln)) {
                        strcpy(mark, "sup");
                }
                else if (bam_is_snd(aln)) {
                        strcpy(mark, "snd");
                }
                else if (! bam_is_sup(aln) && ! bam_is_snd(aln)) {
                        strcpy(mark, "pri");
                }
                //get the alignment length and read length
                alnLen = bam_cigar2rlen(aln->core.n_cigar, bam_get_cigar(aln));
                readLen = aln->core.l_qseq;
                ratio = (float) alnLen/readLen;
                //output
                printf("%s\t%s\t%ld\t%ld\t%ld\t%d\t%d\t%d\t%.2f\t%s",
                        bam_get_qname(aln),
                        hdr->target_name[aln->core.tid],
                        aln->core.pos,
                        bam_aux2i(bam_aux_get(aln,"AS")),
                        bam_aux2i(bam_aux_get(aln,"NM")),
                        alnLen,
                        aln->core.n_cigar,
                        readLen,
                        ratio,
                        mark);
                //cigar = bam_get_cigar(aln);
                //for (i = 0; i < aln->core.n_cigar; i++)
                        //printf("%c%d ", bam_cigar_opchr(cigar[i]), bam_cigar_oplen(cigar[i]));
                printf("\n");
        }

        bam_destroy1(aln);
        hts_itr_destroy(itr);
        bam_hdr_destroy(hdr);
        sam_close(fp);
}

Illumina bam:

getReadsTag BailiuNO.bam . > Illumina.readsTag.txt
Segmentation fault (core dumped)

It ran normally on HiFi bam:

getReadsTag HiFi.reads.bam . > HiFi.readsTag.txt
ls -sh HiFi.readsTag.txt
42M HiFi.readsTag.txt

And I tried to debug it with gdb:

(gdb) n
10                      printf("usage:    %s    bamfile    region(chr:start-end)[.]\n", argv[0]);
(gdb) n
usage:    /home/songlizhi/learning/TEdev/example    bamfile    region(chr:start-end)[.]
14              samFile *fp = sam_open(argv[1], "r");
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaae0f62b in __strstr_sse42 () from /lib64/libc.so.6

Sorry to have no idea on how to fix it, Please give me some suggestions to modify this program, then making it run normally on Illumina bam file. Thanks very much.

daviesrob commented 7 months ago

In gdb, it looks like you didn't give any arguments when running your program, so it crashed when trying to access argv[1]. (You've also commented out the exit(EXIT_FAILURE); after the usage message, which would have prevented this.) Inside gdb, you need to use something like this:

run BailiuNO.bam . > Illumina.readsTag.txt

to see where the program is really going wrong.

From a quick look, I expect the problem may be that one of the reads doesn't have an AS or NM tag, so bax_aux_get() returned NULL. Passing that to bam_aux2i() will have resulted in it trying to dereference a NULL pointer. A safer version would be:

                //output
                uint8_t *as_tag = bam_aux_get(aln,"AS");
                uint8_t *nm_tag = bam_aux_get(aln,"NM");
                printf("%s\t%s\t%ld\t%ld\t%ld\t%d\t%d\t%d\t%.2f\t%s",
                        bam_get_qname(aln),
                        hdr->target_name[aln->core.tid],
                        aln->core.pos,
                        as_tag ? bam_aux2i(as_tag) : -1,
                        nm_tag ? bam_aux2i(nm_tag) : -1,
                        alnLen,
                        aln->core.n_cigar,
                        readLen,
                        ratio,
                        mark);

That will output -1 when the tags are missing, you may want to change that depending on your use case.

whitwham commented 7 months ago

We have a set of example program code samples that you may find helpful. You should look at this one especially.

leon945945 commented 7 months ago

@daviesrob So happy to solve this problem, you are the best! Thank you so much.