pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
24 stars 7 forks source link

Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301 #8

Closed pauline-ng closed 5 years ago

pauline-ng commented 5 years ago

@pauline-ng Hi Pauline, I encountered similar problem like Sian. A big log file with repetitive warnings like "Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301."

All input data are from ensemblplant(ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/ and ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/)

Any advice?

Thanks, Lipeng

_Originally posted by @LipengKang in https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/7#issuecomment-475228113_

pauline-ng commented 5 years ago

@LipengKang

1) When you run the sample files, are SIFT4G databases created correctly? 2) When you look in the fasta directory for Triticum aestivum, do you see protein sequences?

Thank you, Pauline

LipengKang commented 5 years ago

@pauline-ng 1.Sample file worked well! 2.proteinDB was download(uniref90.fasta) which also used in sample test pep.fa (chrftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/pep/) This is my file tree: sift/ |-- chr-src | -- ensemblWheatv1.0.fa.gz |-- dbSNP -- gene-annotation-src |-- Triticum_aestivum.IWGSC.42.gtf.gz `-- Triticum_aestivum.IWGSC.pep.all.fa.gz

I test a suset gtf, it seems errors was always found in make-single-records-BIOPERL.pl line 270, 276, 301, 547; generate-fasta-subst-files-BIOPERL.pl line 615, 620, 888 Thank you, Lipeng

pauline-ng commented 5 years ago

Hi Lipeng,

For item 2, there should be a directory named fasta/
pep.fa downloaded from Ensembl is just used for checking.

Can you send me your config file?

(still on vacation, but this should make it faster to debug when I get back.)

Thanks, Pauline

LipengKang commented 5 years ago

Hi Pauline, For item 2 , the fasta/ directory is empty.

Here are the config file and log file. I also make a new GTF which contain gene_biotype"protein_coding" and labeled as exon, CDS, start_codon and stop_codon for sift4g. But failed to call sift with similar error. This is partial example. chr1A IWGSC_v1.1_201706 exon 40098 40731 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 58474 58897 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 70089 70338 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 CDS 58511 58768 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 start_codon 58766 58768 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 stop_codon 58508 58510 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 70239 70650 . + . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 exon 88242 89245 . + . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 CDS 70239 70553 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 start_codon 70239 70241 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 stop_codon 70554 70556 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; config.txt log.txt

Thanks lipeng

pauline-ng commented 5 years ago

Hi Lipeng,

I think your config file is missing a few lines. Since you are running it from Ensembl, you'll need to set the paths in the config file. Specifically, please set GENE_DOWNLOAD_SITE, PEP_FILE, CHR_DOWNLOAD_SITE, DBSNP_ORGANISM_DOWNLOAD_SITE, DBSNP_VCF_FILE

Here's the example from the saccharomyces_cerevisiae-template.txt file.

GENE_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74//gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.74.gtf.gz PEP_FILE=ftp://ftp.ensembl.org/pub/release-74//fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.EF4.74.pep.all.ga.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74//fasta/saccharomyces_cerevisiae/dna/ DBSNP_ORGANISM_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74///variation/vcf//saccharomyces_cerevisiae/ DBSNP_VCF_FILE=Saccharomyces_cerevisiae.vcf.gz

so it'd be GENE_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gtf/triticum_aestivum/Triticum_aestivum.IWGSC.42.gtf.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/release-42/plants/fasta/triticum_aestivum/dna/ etc.....

Thanks, Pauline

LipengKang commented 5 years ago

Hi, Pauline It also didn't work well but could find many file in fasta directory this time. The similar error in log file. Here list the errors. Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 270 Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 276 Use of uninitialized value $mutated_aa in string eq at make-single-records-BIOPERL.pl line 547. Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301. Use of uninitialized value $aa2 in string eq at generate-fasta-subst-files-BIOPERL.pl line 615. Use of uninitialized value $aa2 in string ne at generate-fasta-subst-files-BIOPERL.pl line 620. Use of uninitialized value $mutated_aa in string eq at generate-fasta-subst-files-BIOPERL.pl line 888.

My config-file: GENE_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/Triticum_aestivum.IWGSC.42.gtf.gz PEP_FILE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/pep/Triticum_aestivum.IWGSC.pep.all.fa.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/ GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=0 MITO_GENETIC_CODE_TABLENAME=Unspecified PARENT_DIR=./test_files/wheat ORG=Tritivum_aestivum ORG_VERSION=v1.0

Running SIFT 4G

SIFT4G_PATH=/data1/home/lipeng/sift4g/bin/sift4g PROTEIN_DB=/data1/publicData/protein_db/uniprot90_Jul2018/uniref90.fasta COMPUTER=GIS-KATNISS

Thanks for your reply! lipeng

pauline-ng commented 5 years ago

There will be some warnings about uninitialized values if SIFT can't translate the DNA to protein properly (like if there are X's in the genomic file.) You can ignore them as long as the most of your proteins are in the database and have predictions.

Were predictions made? Are there files in the folder SIFT_predictions ?

ls SIFT_predictions/* Inspect the SIFT prediction files

Was the database made?

(Please follow https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/

LipengKang commented 5 years ago

Hi, pauline Sorry late to reply, because it runned for two days. The pipeline stopped when " Selecting alignments with median threshold". SIFT_prediction consits of a file named alignments.txt but nothing in / . More tests are doing. Once complete or find other questions, I will consult you.

Thanks, Lipeng

LipengKang commented 5 years ago

Hi,pauline I checked CDS.fa and as you say, many NNNN in sequence leads to translate DNA to protein improperly,such as NNN sequence to amino acid X. I ignored these errors. However, three times stopped in the last two step. Here are the last lines in the log file

` Aligning queries with candidate sequences

Selecting alignments with median threshold: 2.75 `

I don't know why and nothing in /. But Former steps seems ok. fasta/ and SIFT_prediction directory are unempty. How can I continue the step "Selecting alignments with median threshold"?

Thanks, lipeng

pauline-ng commented 5 years ago

Hi Lipeng,

At this point, it sounds like there me a problem with memory or SIFT 4G (in which case I'd divert you to Robert Vaser's repository). Before I put you in contact with Robert --

1) How many protein sequences do you have? (Please run command below)

cat all_prot.fasta | grep ">" | wc -l

2) How many SIFT prediction files do you have? (Please run command below)

ls SIFT_prediction/*prediction | wc -l

3) Can you attach and send the all_prot.fasta file.

Thank you, Pauline

LipengKang commented 5 years ago

Hi, pauline

  1. 132954 protein sequences in all_prot.fasta
  2. no *prediction files in SIFT_prediction. But only one 2.5G file named alignments.txt inside. By the way, compared with homo example, it also has no ENS_peptide_check.txt and tmp directory
  3. here are all_prot.fasta

all_prot.fasta.gz

Thank you lipeng

pauline-ng commented 5 years ago

Hi Lipeng,

Sorry, I got caught up with other things. This is out of my scope. Please contact @rvaser --he is the creator and first author of SIFT4G (there is a github repository called sift4g which he owns)

Thanks, Pauline

rvaser commented 5 years ago

Hi Lipeng, please open an issue at https://github.com/rvaser/sift4g where we can continue the conversation regarding missing SIFT4G predictions.

Best regards, Robert

LipengKang commented 5 years ago

Thanks to pauline and Robert. I will continue this issue to github sift4g. This issue can close here.