rvaser / sift4g

Sorting Intolerant From Tolerant For Genomes
GNU General Public License v3.0
28 stars 11 forks source link

Creating local database is always interrupted at aligning step #37

Open clavedec opened 4 months ago

clavedec commented 4 months ago

Hello,

I have been trying to use make-SIFT-db-all.pl to create a database for chiLan. It was all going well, and the files were being created in the directories singleRecords, fasta and subst (the others are empty). However, I constantly get an email saying the slurm job has failed. It says 'Exit code 255', usually after 11h-12h of run at the step of " Aligning queries with candidate sequences ". Last time it advanced until:

Aligning queries with candidate sequences ... processing database part 1 (size ~1.00 GB): 47.50/100.00%

Since all the files had been created, I decided to run:

~/sift4g/bin/sift4g -d /full_path/scripts_to_build_SIFT_db/GCF_009829145.1/protein.faa -q /full_path/scripts_to_build_SIFT_db/all_prot.fasta --subst /full_path/scripts_to_build_SIFT_db/subst --out /full_path/scripts_to_build_SIFT_db/SIFT_predictions --sub-results

But the alignment does not advance beyond 47.50% due to 'Segmentation fault (core dumped)'. Although it seems to be a memory problem, it is using less memory than I allocated for the job. Any suggestion of what can happening?

Based on a previous issue, I'm here sharing the all_prot.fasta and also the config file I used for make-SIFT-db-all.pl on the following link.

Thank you very much for your help!

Best wishes, Clarissa

ChandlerJun commented 3 months ago

Hello,

I encountered the same problem when running the program in the Slurm system. I removed all the abnormal protein codes beforehand. (e.g., X)

I Try:

  1. Increase memory to 1TB (same error)
  2. Remove proteins with sequence lengths over 35,000 from all_prot.fasta. (same error)
  3. Remove proteins with sequence lengths over 15,000 from all_prot.fasta. (no error)
  4. Test sequence lengths greater than 35,000 individually. (same error)

My protein sequence length distribution was: Length range:Numbers of protein 0-8,999:67,873 15,000-15,999: 1 26,000-26,999: 2 35,000-35,999: 1

My guess might be that the chunk is running out of memory allocation. I hope this can help developers give me suggestions to solve the problem of proteins lengths over 15,000 or fix the bug.

Thank you.

Best wishes, Chandler