walaj / svaba

Structural variation and indel detection by local assembly
GNU General Public License v3.0
235 stars 45 forks source link

Number of sequences in BAM header mismatches reference #126

Open tnnandi opened 1 year ago

tnnandi commented 1 year ago

Hi,

When I am using SvABA with the hg38 (no alt) reference file and a BAM file obtained by aligning a whole human genome to the same reference, I get the following error (only the first few lines are shown):

!!!!!!!!!!! WARNING !!!!!!!!!!!
!!!!!! Number of sequences in BAM header mismatches reference
!!!!!! BAM: 3366 -- Ref: 194
!!!!!! BAM sequence id 1: "chr2" -- Ref sequence id 1: "chr10"
!!!!!! BAM sequence id 2: "chr3" -- Ref sequence id 2: "chr11"
!!!!!! BAM sequence id 3: "chr4" -- Ref sequence id 3: "chr11_KI270721v1_random"
!!!!!! BAM sequence id 4: "chr5" -- Ref sequence id 4: "chr12"
!!!!!! BAM sequence id 5: "chr6" -- Ref sequence id 5: "chr13"
!!!!!! BAM sequence id 6: "chr7" -- Ref sequence id 6: "chr14"
!!!!!! BAM sequence id 7: "chr8" -- Ref sequence id 7: "chr14_GL000009v2_random"
!!!!!! BAM sequence id 8: "chr9" -- Ref sequence id 8: "chr14_GL000225v1_random"
!!!!!! BAM sequence id 9: "chr10" -- Ref sequence id 9: "chr14_KI270722v1_random"
!!!!!! BAM sequence id 10: "chr11" -- Ref sequence id 10: "chr14_GL000194v1_random"
!!!!!! BAM sequence id 11: "chr12" -- Ref sequence id 11: "chr14_KI270723v1_random"
.....

This error arises from the following condition in src/savaba/run_savaba.cpp: if (b_header.NumSequences() != bwa_header.NumSequences()), and when I do samtools view -H <bamfile> | wc -l I do indeed get a number close to 3366 but that includes lot of information in addition to that of just the sequences (e.g., steps for alignment).

So, I don't see any reason why both headers (for the bam file and the reference file) should be of the same length if we are not limiting our comparison to only those lines that contain the sequence information.

I have ensured that the BAM file has indeed been obtained by mapping to the same reference genome (I re-did it myself), so I'm not sure how to get around this issue.

Thanks, Tarak

walaj commented 1 year ago

It seems that perhaps the BamHeader parser is treating the extra sequences differently than the reference parser. What are the tags in the BamHeader or are you able to post an example?

The other issue is that the reference is sorted with chr10 first, and the header has chr2 first. Are you sure the order in the reference is the same as in the BAM header? You’re right though, this warning is meant to get users to align to the same reference, but you already did this.

On Tue, Jul 25, 2023 at 7:11 PM tnnandi @.***> wrote:

Hi,

When I am using SvABA with the hg38 (no alt) reference file and a BAM file obtained by aligning a whole human genome to the same reference, I get the following error (only the first few lines are shown):

!!!!!!!!!!! WARNING !!!!!!!!!!! !!!!!! Number of sequences in BAM header mismatches reference !!!!!! BAM: 3366 -- Ref: 194 !!!!!! BAM sequence id 1: "chr2" -- Ref sequence id 1: "chr10" !!!!!! BAM sequence id 2: "chr3" -- Ref sequence id 2: "chr11" !!!!!! BAM sequence id 3: "chr4" -- Ref sequence id 3: "chr11_KI270721v1_random" !!!!!! BAM sequence id 4: "chr5" -- Ref sequence id 4: "chr12" !!!!!! BAM sequence id 5: "chr6" -- Ref sequence id 5: "chr13" !!!!!! BAM sequence id 6: "chr7" -- Ref sequence id 6: "chr14" !!!!!! BAM sequence id 7: "chr8" -- Ref sequence id 7: "chr14_GL000009v2_random" !!!!!! BAM sequence id 8: "chr9" -- Ref sequence id 8: "chr14_GL000225v1_random" !!!!!! BAM sequence id 9: "chr10" -- Ref sequence id 9: "chr14_KI270722v1_random" !!!!!! BAM sequence id 10: "chr11" -- Ref sequence id 10: "chr14_GL000194v1_random" !!!!!! BAM sequence id 11: "chr12" -- Ref sequence id 11: "chr14_KI270723v1_random" .....

This error arises from the following condition in src/savaba/run_savaba.cpp: if (b_header.NumSequences() != bwa_header.NumSequences()), and when I do samtool view -H | wc -l I do indeed get a number close to 3366 but that include lot of information in addition to that of just the sequences (e.g., steps for alignment).

I have ensured that the BAM file has indeed been obtained by mapping to the same reference genome (I re-did it myself), so I'm not sure how to get around this issue.

Thanks, Tarak

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7CETQJEOUTEGC3IHYMDXSBHAZANCNFSM6AAAAAA2XYA5J4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tnnandi commented 1 year ago

Hi @walaj , thank you very much for the prompt response!

Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link

The first line of the reference file looks like this (it starts with chr1):

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also, this seems to be more than just a warning as the code fails if trigger_explain is True that is caused by a mismatch in the headers.

Thanks, Tarak

walaj commented 1 year ago

This BAM header has 3366 sequence tags as svaba reports, so I don't see an issue here. You'll have to confirm how many sequences the reference fasta has.

On Tue, Jul 25, 2023 at 9:00 PM tnnandi @.***> wrote:

Hi @walaj https://github.com/walaj , thank you very much for the prompt response!

Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link

The first line of the reference file looks like this (it starts with chr1):

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also, this seems to be more than just a warning (it is in fact an error notification) as the code fails if trigger_explain is True that is caused by

for (int i = 0; i < std::min(b_header.NumSequences(), bwa_header.NumSequences()); ++i) {
  if (b_header.IDtoName(i) != bwa_header.IDtoName(i)) {

trigger_explain = true;

Thanks, Tarak

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/126#issuecomment-1650808384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7CEZ4ZPMI53GXBZ4TYLXSBT33ANCNFSM6AAAAAA2XYA5J4 . You are receiving this because you were mentioned.Message ID: @.***>

zhaowsong commented 1 year ago

Hi @walaj , thank you very much for the prompt response!

Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link

The first line of the reference file looks like this (it starts with chr1):

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also, this seems to be more than just a warning as the code fails if trigger_explain is True that is caused by a mismatch in the headers.

Thanks, Tarak

Hi,

When I am using SvABA with the hg38 (no alt) reference file and a BAM file obtained by aligning a whole human genome to the same reference, I get the following error (only the first few lines are shown):

!!!!!!!!!!! WARNING !!!!!!!!!!!
!!!!!! Number of sequences in BAM header mismatches reference
!!!!!! BAM: 3366 -- Ref: 194
!!!!!! BAM sequence id 1: "chr2" -- Ref sequence id 1: "chr10"
!!!!!! BAM sequence id 2: "chr3" -- Ref sequence id 2: "chr11"
!!!!!! BAM sequence id 3: "chr4" -- Ref sequence id 3: "chr11_KI270721v1_random"
!!!!!! BAM sequence id 4: "chr5" -- Ref sequence id 4: "chr12"
!!!!!! BAM sequence id 5: "chr6" -- Ref sequence id 5: "chr13"
!!!!!! BAM sequence id 6: "chr7" -- Ref sequence id 6: "chr14"
!!!!!! BAM sequence id 7: "chr8" -- Ref sequence id 7: "chr14_GL000009v2_random"
!!!!!! BAM sequence id 8: "chr9" -- Ref sequence id 8: "chr14_GL000225v1_random"
!!!!!! BAM sequence id 9: "chr10" -- Ref sequence id 9: "chr14_KI270722v1_random"
!!!!!! BAM sequence id 10: "chr11" -- Ref sequence id 10: "chr14_GL000194v1_random"
!!!!!! BAM sequence id 11: "chr12" -- Ref sequence id 11: "chr14_KI270723v1_random"
.....

This error arises from the following condition in src/savaba/run_savaba.cpp: if (b_header.NumSequences() != bwa_header.NumSequences()), and when I do samtools view -H <bamfile> | wc -l I do indeed get a number close to 3366 but that includes lot of information in addition to that of just the sequences (e.g., steps for alignment).

So, I don't see any reason why both headers (for the bam file and the reference file) should be of the same length if we are not limiting our comparison to only those lines that contain the sequence information.

I have ensured that the BAM file has indeed been obtained by mapping to the same reference genome (I re-did it myself), so I'm not sure how to get around this issue.

Thanks, Tarak

Hi tnnandi, have you solved your issue? I'm also facing the same problem

walaj commented 1 year ago

I still think the reference and the header are different. @zhaowsong can you confirm with grep etc the number of unique reference contigs in both the reference and the bam header you are using?