Open ghuls opened 3 years ago
It's a known long-standing problem. I think it's never been a priority because most people map everything in a project to the same reference, so it usually doesn't arise. See also e.g. samtools/htslib#306 and samtools/htslib#464.
In our case the different BAM files are mapped by different people (over the years). Would it be that hard to fix? Would a hasmap with chromosome name to tid for each BAM file, not fix it? BCFtools at least should give a warning that the chromosome orders does not match, if there are no plans to fix it.
Agreed this is something we need to fix (even if just to error). Some subcommands can handle this case fine though, so it's not a universal issue.
Notice that in the OP's first case there are no -r
options involved, so the BAM files are streamed sequentially and may not even be indexed. So now that there is a header API it would be easy enough to verify all the BAM files' headers are equivalently numbered and produce an error if not, but rather non-trivial in general to jump around independently in each BAM file to stream through the right chromosome at the right time.
https://github.com/samtools/htslib/issues/306#issuecomment-168360924 has a useful toy test case. As I recall, at that time I looked at the pileup machinery in HTSlib with a view to adding the capability of using indexes if needed, and then backed away slowly… :smile:
Lol I hear you. I feel that pain!
Notice that in the OP's case there are no
-r
options involved, so the BAM files are streamed sequentially and may not even be indexed. So now that there is a header API it would be easy enough to verify all the BAM files' headers are equivalently numbered and produce an error if not, but rather non-trivial in general to jump around independently in each BAM file to stream through the right chromosome at the right time.
In my second example I used -r chrM
(and all my BAM files are indexed), as I was trying if I could trick bcftools mpileup
to call mutations per chromosome.
@ghuls Wouldn't you want to spare yourself a world of pain and just remap all data to exactly the same reference ? It's a fundamental weakness in the approach - how would you trust the outcome ? Even small differences might/should break all downstream apps, as you're seeing here. I would never trust analyses someone else has done in the past and seek to add to that, since even the slightest different assumption can render all your further work invalid.
@colindaven We used the BAM files without problems for other parts of our analysis (as they only use one BAM file at the time and they can use the index file fine). When I got the data, they told me they were all mapped used the same index. bcftools mpileup
or other tools that consume multiple BAM files, should at least check if the chromosome header order matches (if it can't handle a different order) and print an error or warning. Just producing garbage is the worst possible case. If it was not for chr1 and chrM that was switched in the beginning, I might not even noticed the problem as I would have submitted the samples to the cluster as bcftools mpileup
takes ages to run, and the mplp_func
error would show up only after a few days.
We are planning to remap the BAM files now, but it shouldn't be needed when they are indexed.
I wrote some ugly code that rewrote the BAM header so the chromosome order was in the order I need and fixed all tid
and mtid
indexes so the match the reordering of the chromsomes. bcftools mpileup
ran fine after that.
It also looks like `bcftools mpileup does not use multiple threads (or at least does not use more than 100% CPU) when reading multiple BAM files.
That does sound pretty hacky! But if it works... It's still something we need to fix though.
You're correct that mpileup doesn't use multiple threads. The only bit it can thread is decompression of input data and compression of output data, but that's not the bottleneck with mpileup. I've been speeding it up a bit, but threading will come at a later date as it's a more challenging prospect.
What depth is your data and are you doing any post-call filtering (even if it's just QUAL>=
something)? The BAQ algorithm is the biggest CPU hit for bcftools. It works well at removing false positives in shallower data, but this becomes less beneficial at higher depth and in some cases the reduction in FP comes at a cost of an increase in FN (false negatives) so it's detrimental. I have a work-in-progress branch which reduces BAQ usage to regions it thinks are most beneficial, but this needs more validation on a wider set of data before recommending it. If you have files which you're confident in knowing the results for, then you can experiment with/without BAQ and do your own validation. If you can get away without it, you should have a very significant speed increase.
I ran it per chromosome and ran bcftools concat
afterwards (but the first chromosomes are much bigger, so they take much more time than the latter ones). At the moment I am not doing much post filtering as the calling is done on bulk ATAC samples and the final VCF is meant to be used with demuxlet (https://github.com/statgen/popscle/), to separated mixed samples from single cell data. So high confidence in mutations is not required.
I am slightly confused by this option:
-G, --group-samples FILE|-
by default, all samples are assumed to come from a single population. This option allows to group samples into populations and apply the HWE assumption within but not across the populations. FILE is a tab-delimited text file with sample
names in the first column and group names in the second column. If - is given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that in low coverage data this inflates the rate of false positives.) The
-G option requires the presence of FORMAT/AD generated at the bcftools mpileup step by providing the -a FMT/AD option.
If I just have samples from different individuals, should I use this option or not (does it find all unique mutations per sample or does normal bcftools mpileup assume all BAM files are from the same individual)?
bcftools mpileup
does not use the correct chromosome when using multiple BAM files which have a different chromosome order.stderr
I also tried with a specific region like:
chrM
So I think
bcftools mpileup
needs to have a per BAM file chromosome name totid
mapping, so it get the data from the correct chromosome for each BAM file.