pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

The build database ran successfully but there was no results file #85

Closed abcdefghijklmn97 closed 9 months ago

abcdefghijklmn97 commented 9 months ago

Hi, When I tried to use the Partial Homo sapiens example to build a database, I encountered some weird issues. It ran successfully, but I didn't get any results in the / folder .

The command line and end-of-run interface look like this:

(sift4g) [root@localhost SIFT4G_Create_Genomic_DB-master]# perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt entered mkdir ./test_files/homo_sapiens_small/GRCh38.83 converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments sift4g -d /nfs/LJH/zhushi/plants/nr.plant.fa -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results Checking query data and substitutions files

Searching database for candidate sequences

This is the result file structure: file structure

In addition to the examples, I had the same problem running my own data.

This is the homo_sapiens-test.txt

GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=./test_files/homo_sapiens_small ORG=homo_sapiens ORG_VERSION=GRCh38.83 DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

Running SIFT 4G

SIFT4G_PATH=sift4g PROTEIN_DB=/nfs/LJH/zhushi/plants/nr.plant.fa

Sub-directories, don't need to change

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

This is the chr1D.txt of my data

GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard

PARENT_DIR=/nfs/LJH/TEST/sift/JYM/Chr1D

ORG=JYM

ORG_VERSION=chr1D

Running SIFT 4G

SIFT4G_PATH=sift4g

PROTEIN_DB=/mnt/SOFT/nr

Sub-directories, don't need to change

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log

I would greatly appreciate your assistance in resolving this issue. If possible, I would like to ask you to check where I might have made a mistake or why I am unable to obtain the result files. If you could provide some guidance or suggestions, I would be very grateful.

Thank you very much!

Best wishes! Jinhua Long

pauline-ng commented 9 months ago

Use full paths, not relative paths. (The config file DIR variables should not have ".")

abcdefghijklmn97 commented 9 months ago

Use full paths, not relative paths. (The config file DIR variables should not have ".")

When my config file is changed to PARENT_DIR=/nfs/LJH/TEST/sift/JYM/SIFT4G_Create_Genomic_DB-master/test_files/homo_sapiens_small is still the same, there is no result file.

pauline-ng commented 9 months ago

What happens when you run

sift4g -d /nfs/LJH/zhushi/plants/nr.plant.fa -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results

(using full paths and not relative paths).Are there any result files in /test_files/homo_sapiens_small/SIFT_predictions ?

abcdefghijklmn97 commented 9 months ago

What happens when you run

sift4g -d /nfs/LJH/zhushi/plants/nr.plant.fa -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results

(using full paths and not relative paths).Are there any result files in /test_files/homo_sapiens_small/SIFT_predictions ?

(sift4g) [root@localhost SIFT4G_Create_Genomic_DB-master]# sift4g -d /nfs/SOFT/nr -q /nfs/LJH/TEST/sift/JYM/SIFT4G_Create_Genomic_DB-master/test_files/homo_sapiens_small/all_prot.fasta --subst /nfs/LJH/TEST/sift/JYM/SIFT4G_Create_Genomic_DB-master/test_files/homo_sapiens_small/subst --out /nfs/LJH/TEST/sift/JYM/SIFT4G_Create_Genomic_DB-master/test_files/homo_sapiens_small/SIFT_predictions --sub-results Checking query data and substitutions files

Searching database for candidate sequences

and no file in /nfs/LJH/TEST/sift/JYM/SIFT4G_Create_Genomic_DB-master/test_files/homo_sapiens_small/SIFT_predictions

abcdefghijklmn97 commented 9 months ago

What happens when you run

sift4g -d /nfs/LJH/zhushi/plants/nr.plant.fa -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results

(using full paths and not relative paths).Are there any result files in /test_files/homo_sapiens_small/SIFT_predictions ?

My own data is plant genome, so I used the plant's protein database, the operation ended without any result as well.

The protein database I use when using Partial Homo sapiens example is plant-based. I don't know if it has any effect?

Also, does the nuclear genome related to this no result ?

pauline-ng commented 9 months ago

Use UniRef90 fasta

https://www.uniprot.org/help/downloads

The SIFT4G algorithm will find the homologous sequences. If your plant protein database is too small and has no homologues, it won't work.

abcdefghijklmn97 commented 9 months ago

Use UniRef90 fasta

https://www.uniprot.org/help/downloads

The SIFT4G algorithm will find the homologous sequences. If your plant protein database is too small and has no homologues, it won't work.

When I downloaded UniRef90 fasta and used it as the protein database, the operation ended successfully but without any result file, just like before.

stella-huynh commented 9 months ago

Hi,

I got the same issue as you with the test data set and I may have found the solution.

I saw this reply from another issue opened here: https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/81#issuecomment-1595348807

I initially used sift4g that was already installed on my cluster. So I checked my gcc version and it was v4.8.5 by default. I changed it to the newest version (v11.2.0 for me). I don't remember if it was better or the same, but there was still some issues. So after having loaded gcc v11.2.0, I tried to install and compile sift4g directly from github (here: https://github.com/rvaser/sift4g). I ran the test data set again with this newly compiled sift4g and I now get the chromosome files in the database folder. I also realized the output message showed on the screen while running the perl script was actually not complete before! This is the full output message I get now on the screen:

[shuynh@core-login1 scripts_to_build_SIFT_db]$ perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt
converting gene format to use-able input
done converting gene format
making single records file
done making single records template
making noncoding records file
done making noncoding records
make the fasta sequences
done making the fasta sequences
start siftsharp, getting the alignments
/shared/projects/domisol/scripts/sift4g/bin/sift4g -d /shared/projects/domisol/scripts/SIFT/uniprot_sprot.fasta -q /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/all_prot.fasta --subst /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/subst --out /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **
* processing database part 2 (size ~0.25 GB): 100.00/100.00% *

** Aligning queries with candidate sequences **
* processing database part 1 (size ~1.00 GB): 100.00/100.00% *

** Selecting alignments with median threshold: 2.75 **
* processing queries: 100.00/100.00% *

** Generating SIFT predictions with sequence identity: 100.00% **
* processing queries: 100.00/100.00% *

done getting all the scores
populating databases
checking the databases
zipping up /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/chr-src/*
All done!

[shuynh@core-login1 scripts_to_build_SIFT_db]$ ll test_files/homo_sapiens_small/GRCh38.83/
total 225748
-rw-rw----+ 1 shuynh shuynh 230776344 14 sept. 02:18 21.gz
-rw-rw----+ 1 shuynh shuynh    117688 14 sept. 02:16 21.regions
-rw-rw----+ 1 shuynh shuynh       507 14 sept. 02:21 21_SIFTDB_stats.txt
-rw-rw----+ 1 shuynh shuynh       240 14 sept. 02:19 CHECK_GENES.LOG
-rw-rw----+ 1 shuynh shuynh      1049 14 sept. 02:22 homo_sapiens-test.txt
-rw-rw----+ 1 shuynh shuynh    230219 14 sept. 02:18 MT.gz
-rw-rw----+ 1 shuynh shuynh       444 14 sept. 02:18 MT.regions
-rw-rw----+ 1 shuynh shuynh       480 14 sept. 02:21 MT_SIFTDB_stats.txt

Although sift4g has been compiled with gcc v11.2.0, I still have to load again gcc v11.2.0 whenever I open a new session, otherwise sift4g shows some error messages. I have started running it on my real dataset today. This takes some time but it seems to be running fine so far.

I hope this will help.

Stella

abcdefghijklmn97 commented 9 months ago

Hi,

I got the same issue as you with the test data set and I may have found the solution.

I saw this reply from another issue opened here: #81 (comment)

I initially used sift4g that was already installed on my cluster. So I checked my gcc version and it was v4.8.5 by default. I changed it to the newest version (v11.2.0 for me). I don't remember if it was better or the same, but there was still some issues. So after having loaded gcc v11.2.0, I tried to install and compile sift4g directly from github (here: https://github.com/rvaser/sift4g). I ran the test data set again with this newly compiled sift4g and I now get the chromosome files in the database folder. I also realized the output message showed on the screen while running the perl script was actually not complete before! This is the full output message I get now on the screen:

[shuynh@core-login1 scripts_to_build_SIFT_db]$ perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt
converting gene format to use-able input
done converting gene format
making single records file
done making single records template
making noncoding records file
done making noncoding records
make the fasta sequences
done making the fasta sequences
start siftsharp, getting the alignments
/shared/projects/domisol/scripts/sift4g/bin/sift4g -d /shared/projects/domisol/scripts/SIFT/uniprot_sprot.fasta -q /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/all_prot.fasta --subst /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/subst --out /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **
* processing database part 2 (size ~0.25 GB): 100.00/100.00% *

** Aligning queries with candidate sequences **
* processing database part 1 (size ~1.00 GB): 100.00/100.00% *

** Selecting alignments with median threshold: 2.75 **
* processing queries: 100.00/100.00% *

** Generating SIFT predictions with sequence identity: 100.00% **
* processing queries: 100.00/100.00% *

done getting all the scores
populating databases
checking the databases
zipping up /shared/projects/domisol/scripts/SIFT/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/chr-src/*
All done!

[shuynh@core-login1 scripts_to_build_SIFT_db]$ ll test_files/homo_sapiens_small/GRCh38.83/
total 225748
-rw-rw----+ 1 shuynh shuynh 230776344 14 sept. 02:18 21.gz
-rw-rw----+ 1 shuynh shuynh    117688 14 sept. 02:16 21.regions
-rw-rw----+ 1 shuynh shuynh       507 14 sept. 02:21 21_SIFTDB_stats.txt
-rw-rw----+ 1 shuynh shuynh       240 14 sept. 02:19 CHECK_GENES.LOG
-rw-rw----+ 1 shuynh shuynh      1049 14 sept. 02:22 homo_sapiens-test.txt
-rw-rw----+ 1 shuynh shuynh    230219 14 sept. 02:18 MT.gz
-rw-rw----+ 1 shuynh shuynh       444 14 sept. 02:18 MT.regions
-rw-rw----+ 1 shuynh shuynh       480 14 sept. 02:21 MT_SIFTDB_stats.txt

Although sift4g has been compiled with gcc v11.2.0, I still have to load again gcc v11.2.0 whenever I open a new session, otherwise sift4g shows some error messages. I have started running it on my real dataset today. This takes some time but it seems to be running fine so far.

I hope this will help.

Stella

Thank you, it worked fine after I changed the gcc version.

Best wishes!

pauline-ng commented 9 months ago

Thanks @stella-huynh . Appreciate it!

ksolari commented 9 months ago

I'm having this issue as well. I've updated my gcc (gcc (GCC) 10.3.0), reinstalled sift4g, and have made sure that I have loaded the updated gcc version before running, but it still does not create a results file.

Here is the command I'm running: perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt

The homo_sapiens-test.txt file:

GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=./test_files/homo_sapiens_small
ORG=homo_sapiens
ORG_VERSION=GRCh38.83
DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

#Running SIFT 4G
SIFT4G_PATH=/oak/stanford/groups/dpetrov/ksolari/SIFT/sift4g/bin/sift4g
PROTEIN_DB=/oak/stanford/groups/dpetrov/ksolari/SIFT/uniref90.fasta

# Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=Log.txt
ZLOGFILE=Log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP

# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

The screen output:

converting gene format to use-able input
done converting gene format
making single records file
done making single records template
making noncoding records file
done making noncoding records
make the fasta sequences
done making the fasta sequences
start siftsharp, getting the alignments
/oak/stanford/groups/dpetrov/ksolari/SIFT/sift4g/bin/sift4g -d /oak/stanford/groups/dpetrov/ksolari/SIFT/uniref90.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **

Any suggestions that anyone can offer will be much appreciated! Thank you!!!

Katie

pauline-ng commented 9 months ago

@ksolari It would be better if you opened up a new issue. After opening up a new issue, please list your directory contents and file sizes.

ksolari commented 9 months ago

Thank you! Will do!

abcdefghijklmn97 commented 9 months ago

I'm having this issue as well. I've updated my gcc (gcc (GCC) 10.3.0), reinstalled sift4g, and have made sure that I have loaded the updated gcc version before running, but it still does not create a results file.

Here is the command I'm running: perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt

The homo_sapiens-test.txt file:

GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=./test_files/homo_sapiens_small
ORG=homo_sapiens
ORG_VERSION=GRCh38.83
DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

#Running SIFT 4G
SIFT4G_PATH=/oak/stanford/groups/dpetrov/ksolari/SIFT/sift4g/bin/sift4g
PROTEIN_DB=/oak/stanford/groups/dpetrov/ksolari/SIFT/uniref90.fasta

# Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=Log.txt
ZLOGFILE=Log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP

# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

The screen output:

converting gene format to use-able input
done converting gene format
making single records file
done making single records template
making noncoding records file
done making noncoding records
make the fasta sequences
done making the fasta sequences
start siftsharp, getting the alignments
/oak/stanford/groups/dpetrov/ksolari/SIFT/sift4g/bin/sift4g -d /oak/stanford/groups/dpetrov/ksolari/SIFT/uniref90.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **

Any suggestions that anyone can offer will be much appreciated! Thank you!!!

Katie

Before installing sift4g, enter gcc -v to make sure gcc is 10.3.0, then run the command.

pauline-ng commented 9 months ago

This issue is closed. Thank you @abcdefghijklmn97 for your help.

@ksolari has opened a new separate thread. (It's very hard to keep track of issues when the same person is posting the same issue multiple times)