pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

Error in built my own database #24

Closed sunshichao0916 closed 3 years ago

sunshichao0916 commented 3 years ago

Hi, SIFT4G team I had a problem in building my own database using sift4g tools. When execute the perl make-SIFT-db-all.pl -config Glymax_config.txt command, there will be in parentDir folder generates all prot.fasta file, but without generated a database and no error was return. I don't know where the problem is. Hope your answers, thank you.

My input files are shown below:

  1. Glymax_config.txt GENE_DOWNLOAD_SITE=/vol3/agis/wangli_group/sunshichao/soybean/P101SC17040637-01-F004/SIFT4G/Glycine_max/gene-annotation-src/Glycine_max.gene.gtf.gz PEP_FILE=/vol3/agis/wangli_group/sunshichao/soybean/P101SC17040637-01-F004/SIFT4G/Glycine_max/gene-annotation-src/soybean.pep.fa CHR_DOWNLOAD_SITE=/vol3/agis/wangli_group/sunshichao/soybean/P101SC17040637-01-F004/SIFT4G/database/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa.gz

GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=11 MITO_GENETIC_CODE_TABLENAME=Plant Plastid Code

PARENT_DIR=/vol3/agis/wangli_group/sunshichao/soybean/P101SC17040637-01-F004/SIFT4G/PARENT_DIR ORG=Glycine_max ORG_VERSION=Gma2.v1

Running SIFT 4G

SIFT4G_PATH=/vol3/agis/wangli_group/sunshichao/miniconda3/bin/sift4g PROTEIN_DB=/vol3/agis/wangli_group/sunshichao/soybean/P101SC17040637-01-F004/SIFT4G/database/uniref90.fasta

Sub-directories, don't need to change

LOGFILE=Log.txt ZLOGFILE=Log2.txt GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src FASTA_DIR=fasta SUBST_DIR=subst SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords/ SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

  1. */chr-src/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa image
  2. */gene-annotation-src/Glycine_max.gene.gtf.gz image
  3. */gene-annotation-src/soybean.pep.fa image
pauline-ng commented 3 years ago

Hi,

Can you go to: https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB#monitoring-the-database-creation-process

and run the commands above to check what's been created and what hasn't?

That's very odd that the protein .fa file is created but the database is not.

Thanks, Pauline

sunshichao0916 commented 3 years ago

Hi, I rerun the perl make-SIFT-db-all.pl -config command according to the tutorial on GitHub. The following files were generated in the PARENT_DIR folder:

  1. /chr-src/directory.index.dir /chr-src/directory.index.pag

  2. /fasta/.fasta image

  3. /gene-annotation-src/noncoding.txt image /gene-annotation-src/protein_coding_genes.txt image

  4. */singleRecords/ image

  5. */subst image

The following files were not generated:

  1. SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
  2. DBSNP_DIR=dbSNP

Thank you, Sun Shichao

pauline-ng commented 3 years ago

Hi Sun,

This looks correct. Just to double check, can you confirm your /singleRecords/ and /subst are not empty?

If those files are not empty, then everything looks right except for calling SIFT 4G algorithm. (The algorithm that actually makes the predictions) Can you confirm the path to the executable sift4g is correct? Also, when you run the test files, you're able to make predictions for SIFT 4G?

Thanks, Pauline

sunshichao0916 commented 3 years ago

Hi Pauline,

First, I confirm my /singleRecords/ and /subst are not empty. Then, I ran the test file, but failed. Finally, I checked the executable sift4g file and found that it does not work properly. So, I reinstalled SIFT4g, but the following error is displayed during make. Is there any solution?

image image

To trouble you many times, thank you Sun Shichao

pauline-ng commented 3 years ago

Hi Sun,

Robert maintains the sift4g algorithm. Please post this issue (the error you wrote just above) on

https://github.com/rvaser/sift4g

and Robert will probably be able to help.

Best, Pauline

rvaser commented 3 years ago

Hi Sun, the error indicates that your compiler does not support c++11 standard, try updating both gcc and g++ compilers.

Best regards, Robert

sunshichao0916 commented 3 years ago

Hi Robert,

I upgraded the software, but the database still failed to built. The alignment fold was empty, and the log file is as follow:

image

When the test file is running, it also stops running in the alignments section.

image

Thanks !

Best regards, Sun Shichao

rvaser commented 3 years ago

What is printed after the sift4g command?

sunshichao0916 commented 3 years ago

Hi Robert and Pauline , No error information was displayed, but the alignment command could not be continued. After two days of troubleshooting, I still couldn't find the problem.

The last print information of nohup.out file was shown in the following figure:

image

Thanks. Sun

pauline-ng commented 3 years ago

Hi Sun,

Can you confirm all_prot.fasta file contains protein sequences?

Also -- you could run the test files OK?

Pauline

sunshichao0916 commented 3 years ago

Hi Pauline,

The all_prot.fasta file contains protein sequences. image

The test file cannot run normally, and it stops running in the alignments. image

Thanks for your answers.

Sun

sunshichao0916 commented 3 years ago

Hi Pauline and Robert,

Thank you for your patience. Unfortunately, due to personal reasons, it is still unable to solve this problem.

Can I invite you or a member of the sift4g team to build the SIFT database for me?

Sincerely, Sun

pauline-ng commented 3 years ago

Hi Sun,

I might be able to ask a former post-doc to build it for you.
For academia, we ask that the person who builds the SIFT database be added as an author on the paper. For industry, it's a service and a fee would be charged.

Thanks, Pauline

sunshichao0916 commented 3 years ago

Hi Pauline,

We built the SIFT database for academic purposes and agreed to add the person who builds the SIFT database as an author on the paper. How can I contact you?

Thank you Pauline.
Sun

pauline-ng commented 3 years ago

@sunshichao0916

I tried looking up your email address. Is it something like 3........33@qq.com ?

If yes, please check your inbox and respond to me, and we can get the ball rolling.

Thanks, Pauline

sunshichao0916 commented 3 years ago

Updata:First, I removed the protein sequence containing "XXX", and then divided the large file into several separate files according to the chromosome number. When I ran the build data command again, the database was built successfully.

cfz1998 commented 2 years ago

Updata:First, I removed the protein sequence containing "XXX", and then divided the large file into several separate files according to the chromosome number. When I ran the build data command again, the database was built successfully. Thanks a lot! 小姐姐

pauline-ng commented 2 years ago

Great, thanks for figuring it out!

cfz1998 commented 2 years ago

Great, thanks for figuring it out!

Hey! pauline It's fine when I divide the chromosomes into operation. But I don't understand the "PROTEIN_DB=". When I build a database for a species, do I have to use the proteome of this species, or is this option to select relative species of this species?

Thank you! dcf

pauline-ng commented 2 years ago

Hi Dcf,

PROTEIN_DB should be a database of protein sequences like NCB redundant, SWISS-PROT/Trembl, etc. SIFT will search for homologous sequences from this database.

Thanks, Pauline

cfz1998 commented 2 years ago

Hi, Pauline.I got it! Thank you very much!