pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

The database was not created successfully by SIFT4G_Create_Genomic_DB #74

Closed danrans123 closed 1 year ago

danrans123 commented 1 year ago

I can't successfully create the database with the test file The error log is detailed below: run.sift4g.err.txt

The out log is detailed below: run.sift4g.out.txt

The error log is detailed below: image

Where directory ASM1036v1.34、directory SIFT_predictions and directory singleRecords_with_scores are empty directories

pauline-ng commented 1 year ago

I don't see any errors. It looks like it was in the middle of running SIFT4G. Did you happen terminate the program? The database may take a few days to run depending on your hardware.

To make sure intermediate files are kept, please comment out line 157 of make-SIFT-db-all.pl

Original: system ("rm $rm_dir");

New: #system ("rm $rm_dir");

and then re-run. You should see files in SIFT_predictions being generated

pauline-ng commented 1 year ago

Actually, looking a bit closer, I think your protein database is wrong.

You set your database to Homo_Sapiens.GRcCh38.pep.all.fa ?

It should be a protein database like SwissProt or Uniref90 - see link. SIFT needs protein homologues from all organisms, not just human.

danrans123 commented 1 year ago

I have replaced

system ("rm $rm_dir"); with

system ("rm $rm_dir");

and Uniref90 as the protein database, but the program still terminates with an error. The error message is as follows: run.sift4g.err.txt The error message is as follows: image

The contents of the gene-annotation-src folder are as follows: image

The contents of the chr-src folder are as follows: image

The contents of the dbSNP folder are as follows: image

The contents of the singleRecords folder are as follows: image

The contents of the fasta folder are as follows: image

The contents of the subst folder are as follows: image

But my SIFT_predictions and singleRecords_with_scores folder is empty.

pauline-ng commented 1 year ago

Hi,

I looked at the run.sift4g.err.txt file and there's no error .

SIFT4G_PATH -d PROTEIN_DB -q all_prot_fasta --subst subst --out SIFT_predictions --sub-results

Can you try running SIFT4G by commandline directly. You'll have to put in SIFT4G_PATH and the PROTEIN_DB.

Files should be generated in SIFT_predictions generated.

Also, on my computer, a human database takes 3 days to complete.

danrans123 commented 1 year ago

Hi, The last line in my run.sift4g.err.txt file shows as blow: image Is that okay?

pauline-ng commented 1 year ago

You are trying to run the test file? What are you using for your protein database?

danrans123 commented 1 year ago

config file: homo_sapiens-test.txt I debugged a few more times and it seems that the "start siftsharp, getting the alignmentsn" step starts to fail to continue,because the output log is as follows.

image

danrans123 commented 1 year ago

Yes, i'm trying to run the test file. uniref90.fasta.gz as uniref90.fasta.gz.

pauline-ng commented 1 year ago

In the commandline above, your database should be uniref90.fasta, not Homo_sapiens.GRCh38.pep.all.fa

Try uncompressing uniref90.fasta.gz so it's uniref90.fasta

danrans123 commented 1 year ago

After modifying as required: image What is the reason for this?

pauline-ng commented 1 year ago

Can you paste the entire screen, what does it say below Illegal instruction?

pauline-ng commented 1 year ago

Also, are you running this on a CPU or GPU, and how did you compile sift4g?

danrans123 commented 1 year ago

The following figure shows all the commands that the program runs, referring to your "SIFT4G_PATH -d PROTEIN_DB -q all_prot_fasta --subst subst --out SIFT_predictions --sub-results" image

I'm running under a linux server and automatically stopped with Illegal instruction after running two 100% processes. But I don't know what illegal instruction mean, it is not detailed. About SIFT4G installation: download the SIFT4G source code and compile it manually.

danrans123 commented 1 year ago

Also, I don't know why. Even though I set the PROTEIN_DB to the path where uniprot_sprot.fasta is located in config .txt, the output log still shows that -d is the Homo_sapiens.GRCh38.pep.all.fa in the gene-annotation-src file。 As shown below: image image . and Homo_sapiens.GRCh38.pep.all.fa file comes with test_files. When I manually delete this file and repeat the command and report the illegal instruction.

pauline-ng commented 1 year ago

You must use full paths in the config file (not relative paths).

Your server sounds like cpu, not gpu. When you compiled sift4g, you just used the command make? How much memory do you have on your machine?

danrans123 commented 1 year ago

I have used full paths in the config file and sift4g compiles successfully, as below: image our sever node memory is at least 122G,as below: image

pauline-ng commented 1 year ago

Please post this issue at https://github.com/rvaser/sift4g

Perhaps @rvaser can help (He wrote the SIFT 4G algorithm)

rvaser commented 1 year ago

Can you directly connect to the computing node you are running sift4g and run make from there? The illegal instruction error usually means incompatibility between the computing and login node CPU instruction set (sift4g uses SIMD instructions in alignment).

Other approach is to replace -march=native to -msse4.1 inside sift4g/vendor/swsharp/swsharp/MakeFile, line 22. Run make afterwards.

danrans123 commented 1 year ago

hi This issue has been resolved because the parameter --recursive was not added during the SIFT4G download process, resulting in some folders being empty. But I ran into another problem with SIFT4G Annotator, I don't know how to compile SIFT4G Annotator on linux as below. image Do you have any suggestions?

pauline-ng commented 1 year ago

java files should run on linux directly.