gpertea / gclib

GCLib - Genomic C++ library of reusable code for bioinformatics projects
Other
33 stars 13 forks source link

use 64bit int for genomic coordinates everywhere (GSeg & up) #13

Open gpertea opened 1 month ago

gpertea commented 1 month ago

Genomes with contigs larger than 2GB cause unpredictable errors/crashes due to the core GSeg data structure using 32bit unsigned int (the unsigned part doesn't help with consistently increasing the limit to 4GB due to inevitable signed int arithmetic).

GFaSeqGet is likely the first to cause a failure in gffread so it needs to be migrated to 64bit coordinates at the same time. This requires gffread/gffcompare updates as well.

Proper tests should be written for these gff parsing and GFaSeqGet on small and large genomes.