bcftools mpileup does not use the correct chromosome when using multiple BAM files which have a different chromosome

ghuls commented 3 years ago

bcftools mpileup does not use the correct chromosome when using multiple BAM files which have a different chromosome order.

# List of BAM files with BAM files with 2 differn chromosome orders:
#   - chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM
#   - chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
bam_filenames=*.bam

bcftools mpileup \
    --threads 30 \
    --output-type u \
    --fasta-ref "${fasta_filename}" \
    --max-depth 8000 \
    --skip-indels \
    ${bam_filenames} \
  | bcftools call \
        --threads 4 \
        --multiallelic-caller \
        --variants-only \
        --skip-variants indels \
        --output-type b \
        --output test.bcf

stderr

[mpileup] 30 samples in 30 input files
[mpileup] maximum number of reads per input file set to -d 8000
[mplp_func] Skipping because 17426 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17442 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17449 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17464 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17464 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17464 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17469 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17473 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17486 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17493 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17503 is outside of 16571 [ref:0]
[mplp_func] Skipping because 17532 is outside of 16571 [ref:0]
[mplp_func] Skipping because 19206 is outside of 16571 [ref:0]
...

I also tried with a specific region like: chrM

bcftools mpileup \
    -r chrM \
    --output-type v \
    --fasta-ref "${fasta_filename}" \
    --max-depth 8000 \
    --skip-indels \
    ${bam_filenames} 
[mpileup] maximum number of reads per input file set to -d 8000
[mplp_func] Skipping because 2756366 is outside of 16571 [ref:24]
[mplp_func] Skipping because 2781409 is outside of 16571 [ref:24]
[mplp_func] Skipping because 2804105 is outside of 16571 [ref:24]
[mplp_func] Skipping because 2806176 is outside of 16571 [ref:24]
[mplp_func] Skipping because 2819751 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3490449 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3550603 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3863042 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3903940 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3903941 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3903945 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3903951 is outside of 16571 [ref:24]
[mplp_func] Skipping because 3931577 is outside of 16571 [ref:24]

So I think bcftools mpileup needs to have a per BAM file chromosome name to tid mapping, so it get the data from the correct chromosome for each BAM file.

jmarshall commented 3 years ago

It's a known long-standing problem. I think it's never been a priority because most people map everything in a project to the same reference, so it usually doesn't arise. See also e.g. samtools/htslib#306 and samtools/htslib#464.

ghuls commented 3 years ago

In our case the different BAM files are mapped by different people (over the years). Would it be that hard to fix? Would a hasmap with chromosome name to tid for each BAM file, not fix it? BCFtools at least should give a warning that the chromosome orders does not match, if there are no plans to fix it.

jkbonfield commented 3 years ago

Agreed this is something we need to fix (even if just to error). Some subcommands can handle this case fine though, so it's not a universal issue.

jmarshall commented 3 years ago

Notice that in the OP's first case there are no -r options involved, so the BAM files are streamed sequentially and may not even be indexed. So now that there is a header API it would be easy enough to verify all the BAM files' headers are equivalently numbered and produce an error if not, but rather non-trivial in general to jump around independently in each BAM file to stream through the right chromosome at the right time.

https://github.com/samtools/htslib/issues/306#issuecomment-168360924 has a useful toy test case. As I recall, at that time I looked at the pileup machinery in HTSlib with a view to adding the capability of using indexes if needed, and then backed away slowly… :smile:

jkbonfield commented 3 years ago

Lol I hear you. I feel that pain!

ghuls commented 3 years ago

Notice that in the OP's case there are no -r options involved, so the BAM files are streamed sequentially and may not even be indexed. So now that there is a header API it would be easy enough to verify all the BAM files' headers are equivalently numbered and produce an error if not, but rather non-trivial in general to jump around independently in each BAM file to stream through the right chromosome at the right time.

In my second example I used -r chrM (and all my BAM files are indexed), as I was trying if I could trick bcftools mpileup to call mutations per chromosome.

colindaven commented 3 years ago

@ghuls Wouldn't you want to spare yourself a world of pain and just remap all data to exactly the same reference ? It's a fundamental weakness in the approach - how would you trust the outcome ? Even small differences might/should break all downstream apps, as you're seeing here. I would never trust analyses someone else has done in the past and seek to add to that, since even the slightest different assumption can render all your further work invalid.

ghuls commented 3 years ago

@colindaven We used the BAM files without problems for other parts of our analysis (as they only use one BAM file at the time and they can use the index file fine). When I got the data, they told me they were all mapped used the same index. bcftools mpileup or other tools that consume multiple BAM files, should at least check if the chromosome header order matches (if it can't handle a different order) and print an error or warning. Just producing garbage is the worst possible case. If it was not for chr1 and chrM that was switched in the beginning, I might not even noticed the problem as I would have submitted the samples to the cluster as bcftools mpileup takes ages to run, and the mplp_func error would show up only after a few days.

We are planning to remap the BAM files now, but it shouldn't be needed when they are indexed.

ghuls commented 3 years ago

I wrote some ugly code that rewrote the BAM header so the chromosome order was in the order I need and fixed all tid and mtid indexes so the match the reordering of the chromsomes. bcftools mpileup ran fine after that.

It also looks like `bcftools mpileup does not use multiple threads (or at least does not use more than 100% CPU) when reading multiple BAM files.

jkbonfield commented 3 years ago

That does sound pretty hacky! But if it works... It's still something we need to fix though.

You're correct that mpileup doesn't use multiple threads. The only bit it can thread is decompression of input data and compression of output data, but that's not the bottleneck with mpileup. I've been speeding it up a bit, but threading will come at a later date as it's a more challenging prospect.

What depth is your data and are you doing any post-call filtering (even if it's just QUAL>=something)? The BAQ algorithm is the biggest CPU hit for bcftools. It works well at removing false positives in shallower data, but this becomes less beneficial at higher depth and in some cases the reduction in FP comes at a cost of an increase in FN (false negatives) so it's detrimental. I have a work-in-progress branch which reduces BAQ usage to regions it thinks are most beneficial, but this needs more validation on a wider set of data before recommending it. If you have files which you're confident in knowing the results for, then you can experiment with/without BAQ and do your own validation. If you can get away without it, you should have a very significant speed increase.

ghuls commented 3 years ago

I ran it per chromosome and ran bcftools concat afterwards (but the first chromosomes are much bigger, so they take much more time than the latter ones). At the moment I am not doing much post filtering as the calling is done on bulk ATAC samples and the final VCF is meant to be used with demuxlet (https://github.com/statgen/popscle/), to separated mixed samples from single cell data. So high confidence in mutations is not required.

I am slightly confused by this option:

           -G, --group-samples FILE|-
               by default, all samples are assumed to come from a single population. This option allows to group samples into populations and apply the HWE assumption within but not across the populations.  FILE is a tab-delimited text file with sample
               names in the first column and group names in the second column. If - is given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that in low coverage data this inflates the rate of false positives.) The
               -G option requires the presence of FORMAT/AD generated at the bcftools mpileup step by providing the -a FMT/AD option.

If I just have samples from different individuals, should I use this option or not (does it find all unique mutations per sample or does normal bcftools mpileup assume all BAM files are from the same individual)?

samtools / bcftools

bcftools mpileup does not use the correct chromosome when using multiple BAM files which have a different chromosome #1412