loosolab / UROPA

Universal RObust Peak Annotator
https://uropa-manual.readthedocs.io/
MIT License
15 stars 6 forks source link

Using CSI instead of TBI indexing to handle large chromosomes in GTF file #24

Closed samuelruizperez closed 10 months ago

samuelruizperez commented 10 months ago

Hi!

Thank you for developing such a great tool!

I tried to run UROPA 4.0.2 and I got these error messages:

...
...
[DEBUG]   Tabix compress
[E::hts_idx_check_range] Region 536192814..537448215 cannot be stored in a tbi index. Try using a csi index
[WARNING] Indexing failed - the GTF is probably unsorted
[WARNING] Attempting to sort with call: grep -v "^#" /path/uropa/test_feature_subset.gtf | sort -k1,1 -k4,4n > /path/uropa/test_sorted.gtf
[E::hts_idx_check_range] Region 536192814..537448215 cannot be stored in a tbi index. Try using a csi index
[ERROR]   Could not index .gtf-file - please check whether the file has the correct 9-column format.
[ERROR]   Logger lost connection to queue - probably due to an error raised from a child process.

I think the issue is that my GTF has some pretty large chromosomes, and they cannot be stored in a TBI index. Is there an easy way to index as CSI instead of TBI or any other way to handle large chromosomes in the GTF file for UROPA?

Thank you!

msbentsen commented 10 months ago

Hi @samuelruizperez , thank you so much for this issue and the pull request! It would be fine for me to force the use of CSI, but for backwards compatibility, I made a fix for uropa 4.0.3 which automatically checks for the largest coordinates and uses CSI if the chromosomes are too large. It's available on PyPI now - I hope this solves your issue! I will close your pull request but feel free to contribute again in the future, thanks! 🥳

samuelruizperez commented 10 months ago

Hi, @msbentsen:

Thank you for the quick reply and fix! Sounds good to me. 🥳