Closed kishwarshafin closed 4 years ago
Hi,
This looks like an interesting idea, and it complements the error catalogues which can be output by assess_assembly
. Please go ahead and make a pull request.
Done, thanks :-)
Having thought more about this, I wonder if what you want to do could be achieved using the existing functionality to use a .bed file mask? One could put in a utility into pomoxis to find all indels larger than a given length and create a bed-file to mask them out. Deletions would be masked out by their start and end reference coordinate, insertions by masking the adjacent reference positions.
The outcome would be similar, namely to avoid counting long indels, but this approach would have a few advantages:
Yes, that is right, there can be many ways we can do this analysis but this seemed to be the most straightforward that requires not to run the analysis pipeline more than once. I see how it can be inconvenient to break the catalog errors but running Pomoxis on HG002 takes about 45 minutes to finish as most of the methods are not parallel. And we had to run it many many times while evaluating models. Doubling that would make it really difficult to use. Also, HG002 comes with a bed file from GIAB, so there would be another step to join those bed files too.
I think if you extend the -l parameter to perform an auxiliary calculation to the basic calculation Pomoxis currently has and gives another summary output where you discard all the long indels, that'd be the best middle ground for this option. But yes, clearer documentation about generating a masked bed file will also achieve the same goal.
@kishwarshafin
I have just pushed @mwykes updates to github and pypi (v0.3.0-alpha.1).
@kishwarshafin, assess_assembly now has a -l option which creates a text file documenting all indels with length >= the provided length (the default it 100). These are also written to a bed file. The -l option alone just documents indels, but does nothing to exclude them from the error calculations. The new -e option to assess_assembly will exclude the indels (found with the -l option) from the calculation, both within stats_from_bam and error_catalogue using the existing bedfile functionality. It does this by subtracting the indel intervals from an interval spanning the whole reference. You can also use the -b option and -e option together to provide a .bed file to limit the analysis to certain regions in addition to removing long indels from those regions; in this case the code will use bedtools to subtract the indels from the intervals in the bedfile and use the resulting bed file with stats_from_bam and catalogue_errors.
Sounds great! Thanks for the help. I'll close this issue for now and will check out all the changes this week. Thanks again.
Hello,
As we have been using
Pomoxis
to quantify the assemblies that we generate, we faced some problems. One of that is when assessing a human genome,assess_assembly
would take long inserts/deletes (>1kb) into account. These are tandem repeats or ALUs, and some are structural variants. The polishers (racon/medaka) are not designed to insert these missing indels. Hence, a fairer comparison would be to give an option where the user can define the length of the longest allowed insert/delete.I have already implemented this method in my fork of Pomoxis (https://github.com/kishwarshafin/pomoxis), and I've only edited the [
stats_to_bam.py
] (https://github.com/kishwarshafin/pomoxis/blob/master/pomoxis/common/stats_from_bam.py). I have addedmasked_stats_from_aligned_read_excluding_longindels
for bed related usage and editedstats_from_aligned_read
.Usage: I have introduced a parameter
-l
that controls the longest allowed indel in the alignment. If the value of-l
is 0, then it'll include everything, or it'll behave the same way as currentassess_assembly
does.Tests: I made sure that all the methods I have edited/written are compatible with the previous results. I have used the assertion of three methods and ran on HG002 and HG00733, including and excluding bed to ensure that all three methods generate the same result summary. See Commit
I also did some other internal tests. Please let me know if this can be included; then, I'll make a pull request.