pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Possible fasta related errors with creating a database #81

Closed hclendenin closed 1 year ago

hclendenin commented 1 year ago

Hello and thank you in advance for your time.

I've been working to create a database for Ursus americanus and have been getting error messages that seem to be related to fasta files. I will include my config file and the messages below.

This is what is contained in my config file:

GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=/home/hrclndnn/SIFT/SIFT_databases/uamer ORG=Ursus_americanus ORG_VERSION=ASM334442v1

Running SIFT 4G

SIFT4G_PATH=/home/hrclndnn/SIFT/sift4g/bin/sift4g PROTEIN_DB=/home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta

Sub-directories, don't need to change

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

An abbreviated version of the messages when I run perl /home/hrclndnn/SIFT/scripts_to_build_SIFT_db/make-SIFT-db-all.pl -config /home/hrclndnn/SIFT/scripts_to_build_SIFT_db/test_files/uamer_config.txt:

converting gene format to use-able input done converting gene format making single records file Use of uninitialized value $fasta_subseq in concatenation (.) or string at make-single-records-BIOPERL.pl line 210, line 1. ... (this repeats with additional line numbers following ) ... Use of uninitialized value $fasta_subseq in concatenation (.) or string at generate-fasta-subst-files-BIOPERL.pl line 446, line 58026. ... (similarly, this message is repeated with other line numbers following ) ... done making the fasta sequences start siftsharp, getting the alignments cat: /home/hrclndnn/SIFT/SIFT_databases/uamer/fasta/*.fasta: No such file or directory /home/hrclndnn/SIFT/sift4g/bin/sift4g -d /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta -q /home/hrclndnn/SIFT/SIFT_databases/uamer/all_prot.fasta --subst /home/hrclndnn/SIFT/SIFT_databases/uamer/subst --out /home/hrclndnn/SIFT/SIFT_databases/uamer/SIFT_predictions --sub-results Checking query data and substitutions files

EXITING! No valid queries to process.

Please let me know if there's any additional information you need or if you have any suggestions for resolving this issue.

Thank you.

pauline-ng commented 1 year ago

Did the test examples work for you?

Also, you don't have to worry about this warning: Use of uninitialized value $fasta_subseq in concatenation (.) or string at make-single-records-BIOPERL.pl l

hclendenin commented 1 year ago

Thank you for getting back to me so quickly.

I ran the candidatus test files previously and had thought it worked correctly due to the new directories and files that were generated, but I've found missing files upon closer inspection. When I reran the code to make that database, I got the following output:

converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /home/hrclndnn/SIFT/sift4g/bin/sift4g -d /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta -q /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/all_prot.fasta --subst /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/subst --out /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/SIFT_predictions --sub-results Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

I also set up and ran the homo sapiens test data with similar output:

converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /home/hrclndnn/SIFT/sift4g/bin/sift4g -d /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

pauline-ng commented 1 year ago

That's odd. The sample file should work.

What do the fasta names of your uniref90.fasta look like?

Can you do grep "^>" /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta | head -20

and show me the result?

hclendenin commented 1 year ago

Here is the information you requested: grep "^>" /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta | head -20

UniRef90_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE UniRef90_A0A410P257 Glycogen synthase n=2 Tax=Candidatus Velamenicoccus archaeovorus TaxID=1930593 RepID=A0A410P257_9BACT UniRef90_A0A8J3NBY6 Uncharacterized protein n=2 Tax=Actinocatenispora rupis TaxID=519421 RepID=A0A8J3NBY6_9ACTN UniRef90_A0A6B0RPA5 Coiled-coil domain-containing protein 141 n=6 Tax=Pecora TaxID=35500 RepID=A0A6B0RPA5_9CETA UniRef90_A0A401TRQ8 Ig-like domain-containing protein (Fragment) n=2 Tax=Chiloscyllium TaxID=34767 RepID=A0A401TRQ8_CHIPU UniRef90_A0A672ZWI7 Ig-like domain-containing protein n=2 Tax=Sphaeramia orbicularis TaxID=375764 RepID=A0A672ZWI7_9TELE UniRef90_A0A6P7YNV3 Titin n=1 Tax=Microcaecilia unicolor TaxID=1415580 RepID=A0A6P7YNV3_9AMPH UniRef90_A0A4U5TZD8 Titin n=1 Tax=Collichthys lucidus TaxID=240159 RepID=A0A4U5TZD8_COLLU UniRef90_UPI00074FFD9C LOW QUALITY PROTEIN: titin n=1 Tax=Gekko japonicus TaxID=146911 RepID=UPI00074FFD9C UniRef90_A0A6P8RG40 Titin isoform X1 n=3 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8RG40_GEOSA UniRef90_A0A091IBA0 Titin (Fragment) n=14 Tax=Archelosauria TaxID=1329799 RepID=A0A091IBA0_CALAN UniRef90_UPI000D72069A LOW QUALITY PROTEIN: titin n=1 Tax=Pelodiscus sinensis TaxID=13735 RepID=UPI000D72069A UniRef90_A0A8J1Y0H7 Ofus.G111214 protein n=2 Tax=Owenia fusiformis TaxID=6347 RepID=A0A8J1Y0H7_OWEFU UniRef90_UPI00202B74A9 titin-like n=1 Tax=Stegostoma fasciatum TaxID=378071 RepID=UPI00202B74A9 UniRef90_UPI00202717A7 LOW QUALITY PROTEIN: titin n=1 Tax=Sphaerodactylus townsendi TaxID=933632 RepID=UPI00202717A7 UniRef90_A0A6P8QZN4 Titin isoform X2 n=1 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8QZN4_GEOSA UniRef90_A0A6P8R4E5 Titin isoform X3 n=1 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8R4E5_GEOSA UniRef90_A0A6P8RJ11 Titin isoform X4 n=2 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8RJ11_GEOSA UniRef90_A0A6P8RG43 Titin isoform X6 n=1 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8RG43_GEOSA UniRef90_A0A6P8QZP2 Titin isoform X7 n=1 Tax=Geotrypetes seraphini TaxID=260995 RepID=A0A6P8QZP2_GEOSA

pauline-ng commented 1 year ago

Can you confirm these start with ">"?

Also, can you remove the string UniRef90. Something like:

sed 's/UniRef90_//g' /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90.fasta > uniref90_clean.fasta

and use this database instead. It may be that SIFT has a character limit for the title, and removing "UniRef90_" creates a short unique title.

hclendenin commented 1 year ago

Hello and thank you again.

The lines do start with the ">" character.

I removed the UniRef90 string as you suggested and resubmitted the jobs (with edits to the config files so that the path was to the new fasta with shorter names). They still failed to run to completion and, once again, produced the following messages:

converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /home/hrclndnn/SIFT/sift4g/bin/sift4g -d /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90_clean.fasta -q /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/all_prot.fasta --subst /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/subst --out /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/SIFT_predictions --sub-results Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

And:

converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /home/hrclndnn/SIFT/sift4g/bin/sift4g -d /home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90_clean.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

I also contacted my HPC coordinator with a request to look over how I set things up in my account-- does it seem like I may have made mistakes when setting up dependencies?

ghost commented 1 year ago

Memphis HPC system administrator here!

This seems to be an issue with compiling sift4g using gcc < 4.9 (our default is quite old). @hclendenin , I went into the sift4g directory and did the following:

module load gcc/8.2.0
make clean
make

Then, I was able to run the following command from the scripts_to_build_SIFT_DB directory (I terminated it because it was taking a while, but you might try to resubmit make_candidatus.sh again to be sure): perl make-SIFT-db-all.pl -config test_files/candidatus_carsonella_ruddii_pv_config.txt --ensembl_download

You will have to add the following command to your submission scripts to run the sift4g program (and these scripts to build the DBs): module load gcc/8.2.0

Make sure you add that after the SBATCH lines, but before running the sift4g scripts (you can add it before or after the module load perl line). You can try and make the same gcc module modification to any other submissions you've tried and gotten the same error with.

hclendenin commented 1 year ago

Hello again, I had success running the homo sapiens test data after the changes my HPC admin mentioned above. The candidatus test data is coming up with a new error (that I can share-- though I'm not sure how far much of the log file I should share). I tried running the Ursus americanus data again and am still getting the same error message as before (copy and pasted in previous comments here).

I'm going to go ahead and share the candidatus log file in full (sorry if this is excessive): downloading gene annotation --2023-06-16 19:10:46-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/gtf//bacteria_11_collection/candidatus_carsonella_ruddii_pv/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/gene-annotation-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz.6’ Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.193.141 Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.193.141|:21|193.62.193.141|:21)... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/bacteria/release-34/gtf//bacteria_11_collection/candidatus_carsonella_ruddii_pv ... done. ==> SIZE Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz ... 14570 ==> PASV ... done. ==> RETR Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz ... done. Length: 14570 (14K) (unauthoritative)

 0K .......... ....                                       100% 13.2M=0.001s

2023-06-16 19:10:48 (13.2 MB/s) - ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/gene-annotation-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz.6’ saved [14570]

--2023-06-16 19:10:48-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/pep/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/gene-annotation-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz’ Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.193.141 Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.193.141|:21|193.62.193.141|:21)... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/pep ... done. ==> SIZE Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz ... done. ==> PASV ... done. ==> RETR Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz ... No such file ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz’.

done downloading gene annotation downloading fasta files --2023-06-16 19:10:49-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/ => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.193.141 Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.193.141|:21|193.62.193.141|:21)... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna ... done. ==> PASV ... done. ==> LIST ... done.

 0K                                                         981K=0.001s

2023-06-16 19:10:50 (981 KB/s) - ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ saved [1009]

Removed ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’. Rejecting ‘CHECKSUMS’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.toplevel.fa.gz’. Rejecting ‘README’. --2023-06-16 19:10:50-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz’ ==> CWD not required. ==> PASV ... done. ==> RETR Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz ... done. Length: 45216 (44K)

 0K .......... .......... .......... .......... ....      100%  177K=0.2s

2023-06-16 19:10:51 (177 KB/s) - ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz’ saved [45216]

FINISHED --2023-06-16 19:10:51-- Total wall clock time: 1.9s Downloaded: 1 files, 44K in 0.3s (176 KB/s) --2023-06-16 19:10:51-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/ => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.193.141 Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.193.141|:21|193.62.193.141|:21)... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna ... done. ==> PASV ... done. ==> LIST ... done.

 0K                                                        78.3M=0s

2023-06-16 19:10:52 (78.3 MB/s) - ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ saved [1009]

Removed ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’. Rejecting ‘CHECKSUMS’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.toplevel.fa.gz’. Rejecting ‘README’. --2023-06-16 19:10:52-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/ => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/index.html’ ==> CWD not required. ==> SIZE ... done. ==> PASV ... done. ==> RETR ... No such file ‘’.

--2023-06-16 19:10:53-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/ => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.193.141 Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.193.141|:21|193.62.193.141|:21)... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna ... done. ==> PASV ... done. ==> LIST ... done.

 0K                                                        87.2M=0s

2023-06-16 19:10:54 (87.2 MB/s) - ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’ saved [1009]

Removed ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/.listing’. Rejecting ‘CHECKSUMS’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_rm.toplevel.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.chromosome.Chromosome.fa.gz’. Rejecting ‘Candidatus_carsonella_ruddii_pv.ASM1036v1.dna_sm.toplevel.fa.gz’. Rejecting ‘README’. --2023-06-16 19:10:54-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/ => ‘/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/index.html’ ==> CWD not required. ==> SIZE ... done. ==> PASV ... done. ==> RETR ... No such file ‘’.

done downloading DNA fasta sequencesdownload dbSNP files Use of uninitialized value $src_site in concatenation (.) or string at download-dbSNP-files.pl line 55. Use of uninitialized value $src_site in concatenation (.) or string at download-dbSNP-files.pl line 60. wget: missing URL Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options. Use of uninitialized value $src_site in concatenation (.) or string at download-dbSNP-files.pl line 64. converting gene format to use-able input done converting gene format gzip: /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa already exists; not overwritten DNA files do not exist or did not unzip properly

pauline-ng commented 1 year ago

I can't tell from this what the error is.

Can you check you're using full paths in your config file? (Not relative paths)

hclendenin commented 1 year ago

Thanks again. I'm using full paths in the config files.

pauline-ng commented 1 year ago

Check this file is actually DNA:

/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa

it's the last line in the message you sent me

hclendenin commented 1 year ago

It looks like DNA in a fasta to me. Here is the head & tail output (*I added a " before the > in the first line of the head output to prevent auto-indenting in this comment):

head Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa " >Chromosome dna:chromosome chromosome:ASM1036v1:Chromosome:1:159662:1 REF ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGA ATTTCTGGAAAAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAA AAAATAGCTACTTTTTCTAAATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATT ATTTTTTTTAAAAAACCAAATACGTTCACTGGAGAAGATATAATCGAATTTCATATTCAC AATAATGAAACTATTGTAAAAAAAATAATTAATTATTTATTATTAAATAAAGCAAGATTT GCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTCTTTAATAGAA TGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCT TTAATAATTTGTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTT

tail Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa TCTTCTGTTAATTTTAATTTTAAAAAAAAGAATGTTTTCTTATTAATAATTATTGATTTT AAAACTAAAATATTTTCGTTATAATTTATTAATAAAATGCTATTTTTTAAAAAAAAAAGT TCAATTAAAATTGTTAGTTCTAATAACTTTATATTATTATATTCGTATATTAACAATTTA TTTAAAATTAATTTTTTGATAAAAAAATAAAAATTTTTATAATTAAGTTTTAAAAAAAAA TATTTATTATTGTTATAAAAAGTAATCACTTGATTCTTATTTTTCAAAATTAACTTATTC TTATTAATAAAATAATTTACAAAATTATTAAAGTTAAAAAAAGTTAATATTTTTTTTAAA TCGATTATAAACAAACTATTTATTTCAGTAATAAAATTTTGATTAAATAAATAAATTATA TTTTTAAATTTTGTAAAAAATAAAATTTTTTTTTTATAAATTTCAATATAAATTTTTTTG TTACAGGAGTTTGATAAAAATAAAATTTTATTTAAAAATAATAAATTGTTTAAATTAACC AT

pauline-ng commented 1 year ago

Can you

grep "^>" /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa

Also, send me first 10 lines of your gene file (.gtf or .gff)

I'm trying to check if the chromosome names match between the genome file and the gene file.

hclendenin commented 1 year ago

This is the output to that grep command: ">Chromosome dna:chromosome chromosome:ASM1036v1:Chromosome:1:159662:1 REF"

And the gtf head:

!genome-build ASM1036v1

!genome-version ASM1036v1

!genome-date 2008-12

!genome-build-accession GCA_000010365.1

!genebuild-last-updated 2008-12

Chromosome ena gene 1 1317 . + . gene_id "CRP_001"; gene_source "ena"; gene_biotype "protein_coding"; Chromosome ena transcript 1 1317 . + . gene_id "CRP_001"; transcript_id "BAF35032"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; Chromosome ena exon 1 1317 . + . gene_id "CRP_001"; transcript_id "BAF35032"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "BAF35032-1"; Chromosome ena CDS 1 1314 . + 0 gene_id "CRP_001"; transcript_id "BAF35032"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "BAF35032"; Chromosome ena start_codon 1 3 . + 0 gene_id "CRP_001"; transcript_id "BAF35032"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

pauline-ng commented 1 year ago

can you tell me what files are in CHR_DOWNLOAD_DEST

ls CHR_DOWNLOAD_DEST

And also please paste your config file here

hclendenin commented 1 year ago

Thank you once again for your time.

My CHR_DOWNLOAD_DEST is set to /home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv/chr-src

ls chr-src Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa directory.index Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa.gz

This is my config file for candidatus:

GENE_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/gtf//bacteria_11_collection/candidatus_carsonella_ruddii_pv/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz PEP_FILE=ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/pep/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.pep.all.fa.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/fasta//bacteria_11_collection/candidatus_carsonella_ruddii_pv/dna/

GENETIC_CODE_TABLE=11 GENETIC_CODE_TABLENAME=11 MITO_GENETIC_CODE_TABLE=0 MITO_GENETIC_CODE_TABLENAME=Unspecified

PARENT_DIR=/home/hrclndnn/SIFT/SIFT_databases/candidatus_carsonella_ruddii_pv ORG=candidatus_carsonella_ruddii_pv ORG_VERSION=ASM1036v1.34

Running SIFT 4G

SIFT4G_PATH=/home/hrclndnn/SIFT/sift4g/bin/sift4g

PROTEIN_DB=/home/hrclndnn/SIFT/SIFT_databases/uniprot_sprot.fasta

PROTEIN_DB=/home/hrclndnn/SIFT/SIFT_databases/uamer/protein-db/uniref90_clean.fasta

Sub-directories, don't need to change

LOGFILE=Log.txt ZLOGFILE=Log2.txt GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src FASTA_DIR=fasta SUBST_DIR=subst SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords/ SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

pauline-ng commented 1 year ago

I just released a Dockerfile to help users with installation problems. Please see if this helps.