squish.o crashes after hashing

shubham1637 commented 1 month ago

Dear Kate, I am trying to run squish.o but it crashed after hashing. I tries it on both computing cluster and Mac, and the error is the same.

/immunopeptidogenomics/squish.o -d file1_transcriptome_3translate.fasta -o file1_full_cryptic.fasta -t 1               
Number of threads specified: 1
input file 1: file1_transcriptome_3translate.fasta
11545597 entries in file1_transcriptome_3translate.fasta
7654321 entries left in file1_transcriptome_3translate.fasta after remove_duplicates()
hash_len = 279740718
creating hash table...
hashing sequence 1 of 7654321
hashing sequence 100001 of 7654321
hashing sequence 200001 of 7654321
.
.
.
hashing sequence 7500001 of 7654321
hashing sequence 7600001 of 7654321
Thread 0 up and running
Thread 0 searching for seq 1
zsh: bus error  ./immunopeptidogenomics/squish.o -d file1_transcriptome_3translate.fasta -

I wonder why is this the case?

On the cluster, I get core dumped message:

/var/spool/slurm/slurmd/job43927444/slurm_script: line 38: 47401 Segmentation fault (core dumped)

The problems seems to be at this line https://github.com/kescull/immunopeptidogenomics/blob/28f1674310b05815036a2e0075412abd269f0eb3/squish.c#L433

kescull commented 1 month ago

Hi Shubham, This looks related to Tyler's trouble in Issue #2. However I couldn't reproduce the error with his files and it ended up working when run on a Linux VM on his Mac - it didn't run on Mac OS. However, he did discover a couple of bad bits of code, and I will update squish accordingly asap. (Namely, deleting line 4, "#include " and also changing the declaration of 'c' from a char to an int - this was a silly error on my part which stops it interpreting the command correctly on some systems.) However, I don't expect these changes to help you. I'm now running the program through valgrind to try to identify dodgy memory issues. Just to check - how much memory are you allocating to it on the computing cluster? And what OS is the cluster running? Thanks, Kate

shubham1637 commented 1 month ago

Thanks Kate, The find_last fails on my Macbook (Apple M2, Sonoma 14.5). The computing cluster is running CentOS 7, I requested 128G and 8 threads. The function works on computing cluster but it seems to be stuck in substring_search function.

kescull commented 1 month ago

Hi Shubham, Hmmm... 128G should be enough, although the file is quite large. Maybe try testing on a smaller version first (just by copying a few hundred thousand lines of it to a new file). CentOS should work, I have run it on CentOS in our cluster, although maybe a version thing... An aside - if you're asking for 8 threads, are you using the -t 8 option? I have updated the copy in the repo regarding the errors I mentioned above, so if you haven't edited your copy of the code yet it wil be easy to grab and recompile. I also finished testing with valgrind, which found no errors or memory leaks - which doesn't mean I didn't do something dumb, just doesn't give me any clues about how to fix it! In any case please feel free to contact me via my Monash email (you can search for me easily, I just don't want to type it here in plain script) if you're ok with sharing the file. Maybe I'll have better luck this time and be able to reproduce the error so I can fix it :) I might also try to find a Mac to borrow to see if I get the error there. When you say 'stuck', do you mean you mean crashing or does it appear to hang? Final thought - to run it on CentOS, did you compile on CentOS or copy over the version you compiled on Mac? If the latter, please try compiling on Linux first. Thanks, Kate

shubham1637 commented 1 month ago

I compiled and ran everything on CentOS. The problem increases when there are too many sequences for squish. I am not sure what is the issue but could be numerical overflow.

With fewer sequences (3.2M)

Number of threads specified: 8
input file 1: file_transcriptome_3translate_10.fasta
input file 2: file_unmasked_transcriptome_3translate_10.fasta
6846172 entries in file_transcriptome_3translate_10.fasta
3159183 entries left in file_transcriptome_3translate_10.fasta after remove_duplicates()
5158403 entries in file_unmasked_transcriptome_3translate_10.fasta
2571805 entries left in file_unmasked_transcriptome_3translate_10.fasta after remove_duplicates()
After merge, 3249913 entries stored from 2 files
hash_len = 135627990
creating hash table...
hashing sequence 1 of 3249913
hashing sequence 100001 of 3249913
hashing sequence 200001 of 3249913
hashing sequence 300001 of 3249913
...
Successful!

This program only took 3G RAM with eight threads (-t 8). So the error is not due to memory.

With too many sequences (5.1M)

Number of threads specified: 8
input file 1: file_transcriptome_3translate_10.fasta
input file 2: file_unmasked_transcriptome_3translate_10.fasta
10612330 entries in file_transcriptome_3translate_10.fasta
4853774 entries left in file_transcriptome_3translate_10.fasta after remove_duplicates()
9409621 entries in file_unmasked_transcriptome_3translate_10.fasta
4588135 entries left in file_unmasked_transcriptome_3translate_10.fasta after remove_duplicates()
After merge, 5153203 entries stored from 2 files
hash_len = 210506593
creating hash table...
hashing sequence 1 of 5153203
hashing sequence 100001 of 5153203
...
hashing sequence 5100001 of 5153203
Thread 0 up and running
Thread 0 searching for seq 1
Thread 3 up and running
Thread 5 up and running
Thread 1 up and running
Thread 6 up and running
Thread 4 up and running
Thread 2 up and running
Thread 7 up and running
Thread 0 searching for seq 100001

It stops here and I get segmentation fault. So it is not stuck per se but gets out possibly due to memory corruption.

kescull commented 1 month ago

Hi Shubham, Thanks for that. Given it runs with a short version of the file and you say it didn't need much memory anyway, and that it got through at least 100000 sequences before crashing, perhaps simple numbers aren't the issue. It seems more like a particular sequence (or something masquerading as one) is causing the issue. The first thing squish does with the sequences is sort them in alphabetical order, so we can't be sure where the guilty sequence originates in the input file. However, I do wonder if you made the files on Mac, maybe there's some Mac-related character that squish and I don't know about. Hopefully an EOF marker! If you made the short input that worked by taking the head of the file, perhaps it got rid of this marker. So firstly, can you open one of the problematic files in vi and check if there's anything obvious at the end of the file? It should just stop after an aa sequence. Secondly, could you please try running with a similar sized chunk of sequences taken from the tail of the file? If I'm correct that should fail even if it's small. Fingers crossed ;)

kescull / immunopeptidogenomics

squish.o crashes after hashing #3