lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
323 stars 17 forks source link

Seg-fault during index formation #3

Closed rwhetten closed 2 years ago

rwhetten commented 2 years ago

I cloned the repo and compiled the code, but I get a segmentation fault when trying to index a fragmented genome with 1.75 million scaffolds. The executable works fine to make an index of GRCh38 (including all alternate scaffolds, so 63 Gb total), so it doesn't appear to be the software itself. Is there a limit on the number of scaffolds in an assembly for indexing? Alternatively, are there characters that might cause problems if present in scaffold names?

lh3 commented 2 years ago

Please try the latest version from github HEAD. There was a bug, though I am not sure if that would lead to segfault.

rwhetten commented 2 years ago

I used git pull, make clean, and make; then tried the index building job again. It ran for longer this time, and wrote the following to stderr: [M::mp_ntseq_read@64.109*0.99] read 22104357184 bases in 1755249 contigs [M::mp_idx_build@64.134*0.99] 174414660 blocks [M::mp_idx_build@732.542*14.65] collected syncmers /var/spool/slurm/slurmd/job5292490/slurm_script: line 22: 777006 Segmentation fault The command used was ~/miniprot -t16 -d $INDEX $GENOME; RAM use reached 100 Gb and runtime 21 minutes.

lh3 commented 2 years ago

One potential cause is memory. The Ensembl version of GRCh38 has many ambiguous bases. Although the total contig length is 63 Gb, there are only ~3.2 Gb actual sequences. Your assembly is 7 times larger. I guess it will take 120-150 GB of memory for indexing.

rwhetten commented 2 years ago

The node that was running the job had 370 Gb RAM allocated, and the output doesn't indicate an out-of-memory error in any way I recognize. The exit code was 139, and RAM use peaked at 100.5 Gb. Would non-alphanumeric, non-underscore characters (such as space or dot) in scaffold names be a problem? Thinking of work-arounds - is there any way to merge indexes of genome subsets into a single index after they are created? I could split the genome into 8 subsets and index them separately. If indexes can't be joined, I could align them separately, with the loss of some information.

lh3 commented 2 years ago

The segmentation fault should be caused by #4, which has been fixed. Let me know if you still have the problem. I am closing this issue for now.