iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Filter reads based on mapped location/position #93

Closed ffranr closed 6 years ago

martinghunt commented 6 years ago

Ideally these options to quasimap:

  1. The reads file is indexed (ie BAM or CRAM)
  2. An interval in the reference genome from which to take the reads, which means 3 things: contig/chromsome name, start position, end position.
  3. Optionally include all unmapped reads (unless we come up with a more clever solution)

Note for 2: include reads that intersect the interval [start,end]. eg if read is mapped from 50-101 and we want to pull out reads from 100-200, then include that read.

iqbal-lab commented 6 years ago

To clarify, for point 1 above, gramtools already allows reading BAM/CRAM as htslib transparently allows it

ffranr commented 6 years ago

Question: Do you know who knows how to use the htslib API to get the reads from a bam which are in a certain range?

Answer (James Bonfield):

The best starting point is probably htslib's test/test_view.c, which has this code in it (paraphrased):

    bam1_t *b = bam_init1();

    hts_idx_t *idx;
    if ((idx = sam_index_load(in, argv[optind])) == 0) {
        fprintf(stderr, "[E::%s] fail to load the BAM index\n", __func__);
        return 1;
    }

    hts_itr_t *iter;
    if ((iter = sam_itr_querys(idx, h, argv[i])) == 0) {
        fprintf(stderr, "[E::%s] fail to parse region '%s'\n", __func__, argv[i]);
        continue;
    }
    while ((r = sam_itr_next(in, iter, b)) >= 0) {
        if (!benchmark && sam_write1(out, h, b) < 0) {
            fprintf(stderr, "Error writing output.\n");
            exit_code = 1;
            break;
        }
        if (nreads && --nreads == 0)
            break;
    }
    hts_itr_destroy(iter);

    if (r < -1) {
        fprintf(stderr, "Error parsing input.\n");
        exit_code = 1;
    }

    bam_destroy1(b);
    bam_hdr_destroy(h);

So basically create an iterator, call it repeatedly until it returns -1 (EOF) or lower (error), and then destroy the iterator at the end.

ffranr commented 6 years ago

The code in it's current state calls this function to input a particular read from a SAM/BAM file: https://github.com/iqbal-lab-org/gramtools/blob/9dd497a01e8efd6422dfc851670f63502817358a/libgramtools/include/sequence_read/seq_file.h#L199

The above function uses functions within HTSlib to extract sequence, quality values, etc. However, the higher level function sam_itr_next is not used. I think the code was written in this way so that the read can be parsed in reverse.

Basically, there's a lot of custom code and the higher level HTSlib functions are not used. This issue doesn't seem to be as trivial as passing in additional variables.

martinghunt commented 6 years ago

Don't worry, I can work around it. Just would have been handy if it was an easy fix.

ffranr commented 6 years ago

@iqbal-lab @martinghunt I don't think that addressing this issue is part of the current plan. Closing the issue.