lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.38k stars 308 forks source link

`seqtk seq` segfaults on 10G scaffolds #191

Closed mcshane closed 1 year ago

mcshane commented 2 years ago

We have a scaffolded assembly of the 90G plant genome. Each chromosome looks to be around 10-11G in length and seqtk seq segfaults on these. Last part of strace below. Playing around with the various scaffold lengths we have in the current assembly, it look like it starts to fall over above the int32 scaffold size.

read(3, "ACCCATAATATTTTTTTTCAAACAATTATTAT"..., 16384) = 16384
read(3, "AATATTC\nCCTTTCTTCGTGGTATAGGATATG"..., 16384) = 16384
read(3, "CGCTCTGCGCCCACTATGCTCCCTGCGGGCGC"..., 16384) = 16384
mremap(0x7f34461f7000, 4294971392, 8589938688, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 8589938688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x5618e365b000)                     = 0x5616e364f000
mmap(NULL, 8590069760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xfffffff1} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
mcshane commented 2 years ago

@c-zhou has added a 64-bit version of seqtk here: c-zhou/seqtk64@25c656cc6a9e48ab45cd39a9c6c74c48b4cf694b

@lh3, would you be open to pulling this in, or better to leave as a seqtk64 fork?

lh3 commented 2 years ago

I will merge if @c-zhou sends a pull request. Thanks!

c-zhou commented 2 years ago

Hi @lh3, I made a pull request https://github.com/lh3/seqtk/pull/192#issue-1260492841. Best.

lh3 commented 1 year ago

Just merged the pull request. Nonetheless, this didn't make into the new v1.4 release. Will need more testing. I am closing this issue now.