Closed pauline-ng closed 5 years ago
@LipengKang
1) When you run the sample files, are SIFT4G databases created correctly? 2) When you look in the fasta directory for Triticum aestivum, do you see protein sequences?
Thank you, Pauline
@pauline-ng
1.Sample file worked well!
2.proteinDB was download(uniref90.fasta) which also used in sample test
pep.fa (chrftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/pep/)
This is my file tree:
sift/
|-- chr-src
| -- ensemblWheatv1.0.fa.gz |-- dbSNP
-- gene-annotation-src
|-- Triticum_aestivum.IWGSC.42.gtf.gz
`-- Triticum_aestivum.IWGSC.pep.all.fa.gz
I test a suset gtf, it seems errors was always found in make-single-records-BIOPERL.pl line 270, 276, 301, 547; generate-fasta-subst-files-BIOPERL.pl line 615, 620, 888 Thank you, Lipeng
Hi Lipeng,
For item 2, there should be a directory named fasta/
pep.fa downloaded from Ensembl is just used for checking.
Can you send me your config file?
(still on vacation, but this should make it faster to debug when I get back.)
Thanks, Pauline
Hi Pauline, For item 2 , the fasta/ directory is empty.
Here are the config file and log file.
I also make a new GTF which contain gene_biotype"protein_coding" and labeled as exon, CDS, start_codon and stop_codon for sift4g. But failed to call sift with similar error. This is partial example.
chr1A IWGSC_v1.1_201706 exon 40098 40731 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 58474 58897 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 70089 70338 . - . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 CDS 58511 58768 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 start_codon 58766 58768 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 stop_codon 58508 58510 . - 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000100.1"; gene_id "TraesCS1A02G000100"; gene_name "TraesCS1A02G000100"; chr1A IWGSC_v1.1_201706 exon 70239 70650 . + . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 exon 88242 89245 . + . gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 CDS 70239 70553 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 start_codon 70239 70241 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200"; chr1A IWGSC_v1.1_201706 stop_codon 70554 70556 . + 0 gene_biotype "protein_coding";transcript_id "TraesCS1A02G000200.1"; gene_id "TraesCS1A02G000200"; gene_name "TraesCS1A02G000200";
config.txt
log.txt
Thanks lipeng
Hi Lipeng,
I think your config file is missing a few lines. Since you are running it from Ensembl, you'll need to set the paths in the config file. Specifically, please set GENE_DOWNLOAD_SITE, PEP_FILE, CHR_DOWNLOAD_SITE, DBSNP_ORGANISM_DOWNLOAD_SITE, DBSNP_VCF_FILE
Here's the example from the saccharomyces_cerevisiae-template.txt file.
GENE_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74//gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.EF4.74.gtf.gz PEP_FILE=ftp://ftp.ensembl.org/pub/release-74//fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.EF4.74.pep.all.ga.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74//fasta/saccharomyces_cerevisiae/dna/ DBSNP_ORGANISM_DOWNLOAD_SITE=ftp://ftp.ensembl.org/pub/release-74///variation/vcf//saccharomyces_cerevisiae/ DBSNP_VCF_FILE=Saccharomyces_cerevisiae.vcf.gz
so it'd be GENE_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gtf/triticum_aestivum/Triticum_aestivum.IWGSC.42.gtf.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/release-42/plants/fasta/triticum_aestivum/dna/ etc.....
Thanks, Pauline
Hi, Pauline It also didn't work well but could find many file in fasta directory this time. The similar error in log file. Here list the errors. Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 270 Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 276 Use of uninitialized value $mutated_aa in string eq at make-single-records-BIOPERL.pl line 547. Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301. Use of uninitialized value $aa2 in string eq at generate-fasta-subst-files-BIOPERL.pl line 615. Use of uninitialized value $aa2 in string ne at generate-fasta-subst-files-BIOPERL.pl line 620. Use of uninitialized value $mutated_aa in string eq at generate-fasta-subst-files-BIOPERL.pl line 888.
My config-file: GENE_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/Triticum_aestivum.IWGSC.42.gtf.gz PEP_FILE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/pep/Triticum_aestivum.IWGSC.pep.all.fa.gz CHR_DOWNLOAD_SITE=ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/ GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=0 MITO_GENETIC_CODE_TABLENAME=Unspecified PARENT_DIR=./test_files/wheat ORG=Tritivum_aestivum ORG_VERSION=v1.0
SIFT4G_PATH=/data1/home/lipeng/sift4g/bin/sift4g PROTEIN_DB=/data1/publicData/protein_db/uniprot90_Jul2018/uniref90.fasta COMPUTER=GIS-KATNISS
Thanks for your reply! lipeng
There will be some warnings about uninitialized values if SIFT can't translate the DNA to protein properly (like if there are X's in the genomic file.) You can ignore them as long as the most of your proteins are in the database and have predictions.
Were predictions made? Are there files in the folder SIFT_predictions ?
ls SIFT_predictions/* Inspect the SIFT prediction files
Was the database made?
(Please follow https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/
Check the database
Monitoring the Database Creation process )
Hi, pauline
Sorry late to reply, because it runned for two days.
The pipeline stopped when " Selecting alignments with median threshold". SIFT_prediction consits of a file named alignments.txt but nothing in
Thanks, Lipeng
Hi,pauline I checked CDS.fa and as you say, many NNNN in sequence leads to translate DNA to protein improperly,such as NNN sequence to amino acid X. I ignored these errors. However, three times stopped in the last two step. Here are the last lines in the log file
` Aligning queries with candidate sequences
Selecting alignments with median threshold: 2.75 `
I don't know why and nothing in /
Thanks, lipeng
Hi Lipeng,
At this point, it sounds like there me a problem with memory or SIFT 4G (in which case I'd divert you to Robert Vaser's repository). Before I put you in contact with Robert --
1) How many protein sequences do you have? (Please run command below)
cat all_prot.fasta | grep ">" | wc -l
2) How many SIFT prediction files do you have? (Please run command below)
ls SIFT_prediction/*prediction | wc -l
3) Can you attach and send the all_prot.fasta file.
Thank you, Pauline
Hi, pauline
Thank you lipeng
Hi Lipeng,
Sorry, I got caught up with other things. This is out of my scope. Please contact @rvaser --he is the creator and first author of SIFT4G (there is a github repository called sift4g which he owns)
Thanks, Pauline
Hi Lipeng, please open an issue at https://github.com/rvaser/sift4g where we can continue the conversation regarding missing SIFT4G predictions.
Best regards, Robert
Thanks to pauline and Robert. I will continue this issue to github sift4g. This issue can close here.
@pauline-ng Hi Pauline, I encountered similar problem like Sian. A big log file with repetitive warnings like "Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301."
All input data are from ensemblplant(ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/ and ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/)
Any advice?
Thanks, Lipeng
_Originally posted by @LipengKang in https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/7#issuecomment-475228113_