ucscGenomeBrowser / kent

UCSC Genome Browser source tree. Stable branch: "beta".
http://genome.ucsc.edu/
Other
217 stars 84 forks source link

Support for chromosomes longer than 2^31 #44

Closed SergejN closed 4 years ago

SergejN commented 4 years ago

Dear UCSC genome browser team,

I ran into a problem trying to create a 2bit file for a large genome we are currently trying to publish. There are four scaffolds that are longer than 2^31bp.

scaf01 4922309470 9 50 51 scaf02 4899412387 5020755678 50 51 scaf03 4208277625 10018156322 50 51 scaf04 2887904217 14310599509 50 51

Observed behavior:

faToTwoBit -long -noMask scaffolds_v2.fa scaf.2bit
expandFaFastBuf: integer overflow when trying to increase buffer size from 2147483648 to a min of 51.

Is it easy to set the variable newBufSize to unsigned long instead of unsigned int in order to deal with such long sequences?https://github.com/ucscGenomeBrowser/kent/blob/022eb4f62a0af16526ca1bebcd9e68bd456265dc/src/lib/fa.c#L299 Or does it break the front-end, too? It seems that needHugeMem(size_t size) at https://github.com/ucscGenomeBrowser/kent/blob/cb63a575751c4a60e48340bd320934cccb053f75/src/lib/memalloc.c#L144 can already deal with large amounts of memory so my guess is that it should be a relatively easy fix, unless the front-end cannot deal with such large values.

Thank you very much! Sergej

braneyboo commented 4 years ago

Hey @SergejN , it's inevitable that we'll have to do this work eventually, but there are 32 bit dependencies throughout the browser code, and relatively little need for such huge chromosomes, so it's never reached the top of our priority list. Personally, I think it won't be such a huge problem to fix, but I've been known to underestimate how much trouble can be caused by trying to achieve what seem to be easy goals ;-)

Is there a way to split your large chromosomes into separate arms? It's kind of lame, but there usually aren't any features that span centromeres.

Thanks for looking into this and for your observations. We take the suggestions of our user base very seriously.

SergejN commented 4 years ago

Hi @braneyboo , thanks for the quick response. Splitting the chromosomes might be a solution, however, as you can see in the list above, at least two scaffolds are longer than 4.8Gb (and a third one, which is just at the border), which means that even the chromosome arms will have to be split.. Moreover, the current assembly doesn't have any gaps between the contigs, which means that the scaffolds may be even longer than what we have now.