Requesting feature: give option to exclude long indels from assess assembly bam_stats.

kishwarshafin commented 4 years ago

Hello,

As we have been using Pomoxis to quantify the assemblies that we generate, we faced some problems. One of that is when assessing a human genome, assess_assembly would take long inserts/deletes (>1kb) into account. These are tandem repeats or ALUs, and some are structural variants. The polishers (racon/medaka) are not designed to insert these missing indels. Hence, a fairer comparison would be to give an option where the user can define the length of the longest allowed insert/delete.

I have already implemented this method in my fork of Pomoxis (https://github.com/kishwarshafin/pomoxis), and I've only edited the [stats_to_bam.py] (https://github.com/kishwarshafin/pomoxis/blob/master/pomoxis/common/stats_from_bam.py). I have added masked_stats_from_aligned_read_excluding_longindels for bed related usage and edited stats_from_aligned_read.

Usage: I have introduced a parameter -l that controls the longest allowed indel in the alignment. If the value of -l is 0, then it'll include everything, or it'll behave the same way as current assess_assembly does.

Tests: I made sure that all the methods I have edited/written are compatible with the previous results. I have used the assertion of three methods and ran on HG002 and HG00733, including and excluding bed to ensure that all three methods generate the same result summary. See Commit

I also did some other internal tests. Please let me know if this can be included; then, I'll make a pull request.

cjw85 commented 4 years ago

Hi,

This looks like an interesting idea, and it complements the error catalogues which can be output by assess_assembly. Please go ahead and make a pull request.

kishwarshafin commented 4 years ago

Done, thanks :-)

mwykes commented 4 years ago

Having thought more about this, I wonder if what you want to do could be achieved using the existing functionality to use a .bed file mask? One could put in a utility into pomoxis to find all indels larger than a given length and create a bed-file to mask them out. Deletions would be masked out by their start and end reference coordinate, insertions by masking the adjacent reference positions.

The outcome would be similar, namely to avoid counting long indels, but this approach would have a few advantages:

There would be a record of what was excluded in the form of the bed file (one can easily figure out what fraction of genome is affected, visualise the regions).
Masking of long indels will be treated in the same way for both stats_from_bam and catalogue_errors so that their output is based on the same aligned segments. The current implementation in this pull request breaks that, so for example the overall qscore reported by each will differ, which is not ideal. One could of course add a -l option to catalogue_errors as well, but using a bed mask just seems a more natural way to achieve this.

kishwarshafin commented 4 years ago

Yes, that is right, there can be many ways we can do this analysis but this seemed to be the most straightforward that requires not to run the analysis pipeline more than once. I see how it can be inconvenient to break the catalog errors but running Pomoxis on HG002 takes about 45 minutes to finish as most of the methods are not parallel. And we had to run it many many times while evaluating models. Doubling that would make it really difficult to use. Also, HG002 comes with a bed file from GIAB, so there would be another step to join those bed files too.

I think if you extend the -l parameter to perform an auxiliary calculation to the basic calculation Pomoxis currently has and gives another summary output where you discard all the long indels, that'd be the best middle ground for this option. But yes, clearer documentation about generating a masked bed file will also achieve the same goal.

cjw85 commented 4 years ago

@kishwarshafin

I have just pushed @mwykes updates to github and pypi (v0.3.0-alpha.1).

mwykes commented 4 years ago

@kishwarshafin, assess_assembly now has a -l option which creates a text file documenting all indels with length >= the provided length (the default it 100). These are also written to a bed file. The -l option alone just documents indels, but does nothing to exclude them from the error calculations. The new -e option to assess_assembly will exclude the indels (found with the -l option) from the calculation, both within stats_from_bam and error_catalogue using the existing bedfile functionality. It does this by subtracting the indel intervals from an interval spanning the whole reference. You can also use the -b option and -e option together to provide a .bed file to limit the analysis to certain regions in addition to removing long indels from those regions; in this case the code will use bedtools to subtract the indels from the intervals in the bedfile and use the resulting bed file with stats_from_bam and catalogue_errors.

kishwarshafin commented 4 years ago

Sounds great! Thanks for the help. I'll close this issue for now and will check out all the changes this week. Thanks again.

nanoporetech / pomoxis

Requesting feature: give option to exclude long indels from assess assembly bam_stats. #36