Closed Tobi1kenobi closed 4 years ago
PyRanges might be a relevant and more lightweight way to replace Pybedtools : PyRanges: efficient comparison of genomic intervals in Python https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz615/5543103 https://github.com/biocore-NTNU/pyranges
Direct link to BEDOPS suggested by @Tobi1kenobi : https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html
I think I've resolved this issue now using BEDOPS' partition functionality as mentioned above. The changes are integrated into the pipeline on the temporary branch feature/bedops-restructure
. I will merge into develop then master once I've tested it a bit more to be sure it is stable.
@Tobi1kenobi fixed this in https://github.com/perslab/CELLECT/commit/8ef90d1d6e9e15d93f08b1e5e15029c4150f57bf This improvement is scheduled for release v. 1.1.0.
What is the problem?
The way Finucane et. al and Skene, Bryois et. al make their annotation files with Pybedtools is only appropriate for binary annotations. Because our annotations are continuous, merging genes that overlap by applying the max of the genes' ES values and propagating that across all parts of the involved genes will assign inappropriate ES values to some parts and skew our final LD score regression probabilities.
How can we fix it?
Ensure that when we merge overlapping genes - the parts of the genes that do not overlap are treated differently (i.e. left alone) and the parts of the genes that do overlap are given the max (or mean or min) ES value of the overlapping genes. This might not be possible with bedtools and we may have to use BEDOPS, see here: https://www.biostars.org/p/267974/