sstadick / perbase

Per-base per-nucleotide depth analysis
MIT License
115 stars 13 forks source link

Update par_granges_example.rs #58

Closed y9c closed 1 year ago

y9c commented 1 year ago

bed and vcf order is not correct

sstadick commented 1 year ago

Thank you!!

y9c commented 1 year ago

Hi @sstadick,

I have a quick question about this command.

    // Create a par_granges runner
    let par_granges_runner = par_granges::ParGranges::new(
        PathBuf::from("test/sample1.bam"),    // pass in bam
        Some(PathBuf::from("test/reference.fa")), // optional ref fasta
        Some(PathBuf::from("test/test.bed")), // bedfile to narrow regions
        None, // optional bcf/vcf file specifying positions of interest
        true, // merge any overlapping intervals in the BED file
        None, // optional allowed number of threads, defaults to max
        None, // optional chunksize modification
        None, // optional channel size modification
        basic_processor,
    );

I add reference.fa into the ParGranges object, but in the output file ref_base is still None.

PileupPosition { ref_seq: "XII", pos: 729187, ref_base: None, depth: 276, a: 268, c: 0, g: 0, t: 0, n: 0, ins: 0, del: 8, ref_skip: 0, fail: 0, near_max_depth: false }

How can I fix this?

Thank you.

sstadick commented 1 year ago

Is your reference file indexed? And do the chromosome names between the fasta file, bed file, and BAM header match? They may all have to be sorted the same as well - it’s been a while since I dug into that code.

Sent from my iPhone

On Sep 29, 2022, at 5:54 PM, Chang Y @.***> wrote:

 Hi @sstadick,

I have a quick question about this command.

// Create a par_granges runner
let par_granges_runner = par_granges::ParGranges::new(
    PathBuf::from("test/sample1.bam"),    // pass in bam
    Some(PathBuf::from("test/reference.fa")), // optional ref fasta
    Some(PathBuf::from("test/test.bed")), // bedfile to narrow regions
    None, // optional bcf/vcf file specifying positions of interest
    true, // merge any overlapping intervals in the BED file
    None, // optional allowed number of threads, defaults to max
    None, // optional chunksize modification
    None, // optional channel size modification
    basic_processor,
);

I add reference.fa into the ParGranges object, but in the output file ref_base is still None.

PileupPosition { ref_seq: "XII", pos: 729187, ref_base: None, depth: 276, a: 268, c: 0, g: 0, t: 0, n: 0, ins: 0, del: 8, ref_skip: 0, fail: 0, near_max_depth: false }

How can I fix this?

Thank you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

sstadick commented 1 year ago

Actually- I think the example is just a bit incomplete in covering how to set up a processor to do what you want. Take a look at the BaseProcessor in sec/commands/base_depth.rs (apologies for no link, I'm on my phone).

Like I said though it's been a bit since I've really gotten into this code so I might be off base. I'll take a look tomorrow when I have a computer.

y9c commented 1 year ago

@sstadick , Thank you for you help.

Yes. Both bam and fasta are indexed and sorted in the same order. I tried to change the Some(PathBuf::from("test/reference.fa")) into Some(PathBuf::from("something_not_exist")), and the code and execute without error. I think par_granges object might not pares fasta input when the data is in bam format.

y9c commented 1 year ago

Hi, @sstadick . I make reference base output work by passing fasta input in the reads buffer.

However the speed of this pileup function is quite slow. I saw you pushed an pileup function into noodles library. https://github.com/zaeleus/noodles/pull/14 . Will it be better in speed? Or have you found better solution to improve this? Thanks.

sstadick commented 1 year ago

@y9c, sorry for the slow reply, I'm glad you got something working! If you are able to share your implementation I would be happy to take a look for low-hanging efficiency fixes.

The short answer is that the PR I have open isn't finished, and may not be faster because it would necessitate a complete migration to Noodles instead of htslib. I haven't revisited the idea in a while but in theory having a custom pileup engine should remove a few copies / redundancies compared to using htslib itself. I don't think it will be a massive speedup though.

All the other methods for depth / coverage calculation in perbase are fast specifically because they use tricks to avoid looking at the actual bases. As soon as you have to look at each base it's going to be orders of magnitude slower. If perbase, or the lib, is slower than samtools or any other library I would consider that a bug.