pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

Is the database well constructed? Got #27

Closed yywyaoyaowu closed 3 years ago

yywyaoyaowu commented 3 years ago

Hi SIFT4G_Create_Genomic_DB team.

When I construct the database, I met warning and error. I have checked the data as 'Check the Database pipeline'. But I'm not sure if the database is well constructed. I go to // and check the data, zcat .gz | more , zcat .gz | grep CDS | more , CHECK_GENES.LOG seem are all well produced. Then I checked the pre-files: *_ls -lt /singleRecords_with_scores/ no files_* ls -lt /singleRecords/ has files `ls fasta/ has files SIFT 4G Algorithm is running ls SIFT_predictions/ with files

is the 'singleRecords_with_scores' affect the database construction much???? Thanks very much! The warning and error I got is :

  1. 'Use of uninitialized value $fasta_subseq in concatenation (.) or string at make-single-records-BIOPERL.pl line 210, line 2132.'

  2. Argument "" isn't numeric in numeric eq (==) at generate-fasta-subst-files-BIOPERL.pl line 465, line 265.

  3. Unable to read from /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/singleRecords/chr01.singleRecords_noncoding File "make_regions_file.py", line 61 print 'check_SIFTDB.py ' SyntaxError: Missing parentheses in call to 'print'. Did you mean print('check_SIFTDB.py ')?

  4. cat: '/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/singleRecords/contig_1381:::fragment_1:::debris.singleRecords': Unable to read from /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/singleRecords/contig_1328.singleRecords_noncoding

  5. SyntaxError: Missing parentheses in call to 'print'. Did you mean print('check_SIFTDB.py ')? rm: cannot remove '/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_E4_63/singleRecords_with_scores/contig_977:::debris_scores.Srecords': No such file or d checking the databases File "check_SIFTDB.py", line 44 if predicted_on > 0: ^ TabError: inconsistent use of tabs and spaces in indentation File "check_SIFTDB.py", line 44 if predicted_on > 0:

  6. TabError: inconsistent use of tabs and spaces in indentation zipping up /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_E4_63/chr-src/* All done!

yywyaoyaowu commented 3 years ago

In the 'singleRecords' folder, I have file 'chr01.invalidRecords'; 'chr01.singleRecords', 'chr01.singleRecords_proteins.fa'. all with infor. With a empty file name as 'chr01.singleRecords_noncoding.with_dbSNPid'. And don't have file 'chr01.singleRecords_noncoding'. Is this why I can't run 'check_SIFTDB.py' and 'make_regions_file.py' sucessfully?

yywyaoyaowu commented 3 years ago

This the 'CHECK_GENES.LOG', the percentages are high for the last 3 different columns.

Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores chr01 97 (4474/4598) 99 (11831765/11930057) 83(9784138/11831765) chr02 97 (3154/3253) 99 (8303321/8378083) 83(6886635/8303321) chr03 96 (2827/2930) 99 (7308413/7397183) 80(5876341/7308413) chr04 97 (2803/2880) 99 (6856636/6915256) 82(5593673/6856636) chr05 97 (3393/3484) 99 (9083595/9154269) 85(7732478/9083595) chr06 97 (2548/2624) 99 (6482815/6548980) 81(5250565/6482815) chr07 97 (2576/2648) 99 (6725641/6784461) 84(5621351/6725641) chr08 98 (3261/3336) 99 (8151815/8210929) 80(6493946/8151815) chr09 97 (2486/2566) 99 (6556604/6622526) 84(5477280/6556604) chr10 97 (2424/2501) 99 (6157096/6221446) 81(4980700/6157096) chr11 98 (2299/2357) 99 (6169867/6214210) 81(5006894/6169867) chr12 100 (2512/2524) 100 (7297827/7307424) 89(6516516/7297827) contig_1011 0 (0/0) 0 (0/0) 0(0/0) contig_1013:::fragment_1:::debris 0 (0/0) 0 (0/0) 0(0/0) contig_1079:::debris 0 (0/0) 0 (0/0) 0(0/0) contig_1088:::fragment_1:::debris 0 (0/0) 0 (0/0) 0(0/0) contig_10:::debris 0 (0/0) 0 (0/0) 0(0/0) contig_1133:::fragment_2:::debris 0 (0/0) 0 (0/0) 0(0/0)

pauline-ng commented 3 years ago

Hi,

The scripts for SIFT4G were written for python2 and are not compatible with python3. Please use python2.

Thanks, Pauline

yywyaoyaowu commented 3 years ago

Do I need to re-construct the database?

pauline-ng commented 3 years ago

No need to re-construct the database. Just run check_SIFTDB.py alone, it generates database stats to make sure it looks right.

yywyaoyaowu commented 3 years ago

what's the input parameter for check_SIFTDB.py?

pauline-ng commented 3 years ago

make a perl script for this check_SIFTDB_only.pl

In this file:

use strict;
use File::Basename;
use Cwd qw(abs_path);
my $directory_of_script = dirname(abs_path(__FILE__));
require $directory_of_script . '/common-utils.pl';

my $metafile=  # SET PATH TO YOUR METAFILE HERE
my $meta_href = readMeta($metafile);
my %meta_hash = %{$meta_href};

my @chromosomes =  getChr ($meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"GENE_DOWNLOAD_DEST"});

foreach my $chr (@chromosomes) {

        my $db_file = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/" . $chr .  "_scores.Srecords.with_dbSNPid.sorted";
        my $check_outfile = $meta_hash{"PARENT_DIR"}  .  "/" . $meta_hash{"ORG_VERSION"} . "/" . $chr .  "_SIFTDB_stats.txt";
        `python2 check_SIFTDB.py $db_file $check_outfile`;
}`

then run perl check_SIFTDB_only.pl

yywyaoyaowu commented 3 years ago

my PARENT_DIR=/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26 When I set my $directory_of_script =/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB my $metafile=Potato_A6_26/Potato_A6_26 and run perl check_SIFTDB_only.pl ,it report syntax error at check_SIFTDB_only.pl line 5, near "SIFT4G_Create_Genomic_DB # dirname(abs_path(FILE)); require " Global symbol "$directory_of_script" requires explicit package name (did you forget to declare "my $directory_of_script"?) at check_SIFTDB_only.pl line 5. syntax error at check_SIFTDB_only.pl line 8, near "Potato_A6_26 # SET PATH TO YOUR METAFILE HERE my "

pauline-ng commented 3 years ago

Please paste your entire code in code format here. There is no $directory_of_script -- you should be just setting "$metafile", which contains the directories.

yywyaoyaowu commented 3 years ago

This is my code use strict; use File::Basename; use Cwd qw(abs_path); my $directory_of_script =/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB # dirname(abs_path(FILE)); require $directory_of_script . '/common-utils.pl';

my $metafile=Potato_A6_26/Potato_A6_26 # SET PATH TO YOUR METAFILE HERE my $meta_href = readMeta($metafile); my %meta_hash = %{$meta_href};

my @chromosomes = getChr ($meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"GENE_DOWNLOAD_DEST"});

foreach my $chr (@chromosomes) {

    my $db_file = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/" . $chr .  "_scores.Srecords.with_dbSNPid.sorted";
    my $check_outfile = $meta_hash{"PARENT_DIR"}  .  "/" . $meta_hash{"ORG_VERSION"} . "/" . $chr .  "_SIFTDB_stats.txt";
    `python2 check_SIFTDB.py $db_file $check_outfile`;

}`

yywyaoyaowu commented 3 years ago

This is the config file: GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26 ORG=Potato_A6_26 ORG_VERSION=Potato_A6_26

DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

Running SIFT 4G

SIFT4G_PATH=/bigdrive/sift4g/bin/sift4g

PROTEIN_DB=/bigdrive/SIFT_databases/uniprot_sprot.fasta

SIFT4G_PATH=/workdir/yw2326/SIFT/sift4g/bin/sift4g PROTEIN_DB=/workdir/yw2326/SIFT/PROTEIN_DB/uniref90.fasta

Sub-directories, don't need to change

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

pauline-ng commented 3 years ago

Should be:


use strict;
use File::Basename;
use Cwd qw(abs_path);
my $directory_of_script = dirname(abs_path(__FILE__));
require $directory_of_script . '/common-utils.pl';

my $metafile=  "Potato_A6_26/Potato_A6_26";  # CAN YOU WRITE THE FULL PATH TO YOUR CONFIG FILE HERE?
my $meta_href = readMeta($metafile);
my %meta_hash = %{$meta_href};

my @chromosomes =  getChr ($meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"GENE_DOWNLOAD_DEST"});

foreach my $chr (@chromosomes) {

        my $db_file = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/" . $chr .  "_scores.Srecords.with_dbSNPid.sorted";
        my $check_outfile = $meta_hash{"PARENT_DIR"}  .  "/" . $meta_hash{"ORG_VERSION"} . "/" . $chr .  "_SIFTDB_stats.txt";
        `python2 check_SIFTDB.py $db_file $check_outfile`;
}
  1. Modify line 7 only. It should be the full path to your config file.
  2. This script needs to be in the folder: /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB
pauline-ng commented 3 years ago

The only thing to set is $metafile

yywyaoyaowu commented 3 years ago

What's $metafile? I tried /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/Potato_A6_26 and /workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26 , neither is ok. use strict; use File::Basename; use Cwd qw(abs_path); my $directory_of_script = dirname(abs_path(FILE)); require $directory_of_script . '/common-utils.pl';

my $metafile=/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/Potato_A6_26 # SET PATH TO YOUR METAFILE HERE my $meta_href = readMeta($metafile); my %meta_hash = %{$meta_href};

my @chromosomes = getChr ($meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"GENE_DOWNLOAD_DEST"});

foreach my $chr (@chromosomes) {

    my $db_file = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/" . $chr .  "_scores.Srecords.with_dbSNPid.sorted";
    my $check_outfile = $meta_hash{"PARENT_DIR"}  .  "/" . $meta_hash{"ORG_VERSION"} . "/" . $chr .  "_SIFTDB_stats.txt";
    `python2 check_SIFTDB.py $db_file $check_outfile`;

}`

pauline-ng commented 3 years ago

$metafile should be set to the path of your config file

yywyaoyaowu commented 3 years ago

it still report errror, attched is the code I use:syntax error at check_SIFTDB_only.pl line 8, near "txt # SET PATH TO YOUR METAFILE HERE my " Global symbol "$meta_href" requires explicit package name (did you forget to declare "my $meta_href"?) at check_SIFTDB_only.pl line 8. Global symbol "$metafile" requires explicit package name (did you forget to declare "my $metafile"?) at check_SIFTDB_only.pl line 8. Global symbol "$meta_href" requires explicit package name (did you forget to declare "my $meta_href"?) at check_SIFTDB_only.pl line 9. Can't find string terminator "`" anywhere before EOF at check_SIFTDB_only.pl line 18.

use strict; use File::Basename; use Cwd qw(abs_path); my $directory_of_script = dirname(abs_path(FILE)); require $directory_of_script . '/common-utils.pl';

my $metafile=/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/Potato_A6_26_config.txt # SET PATH TO YOUR METAFILE HERE my $meta_href = readMeta($metafile); my %meta_hash = %{$meta_href};

my @chromosomes = getChr ($meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"GENE_DOWNLOAD_DEST"});

foreach my $chr (@chromosomes) {

    my $db_file = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/" . $chr .  "_scores.Srecords.with_dbSNPid.sorted";
    my $check_outfile = $meta_hash{"PARENT_DIR"}  .  "/" . $meta_hash{"ORG_VERSION"} . "/" . $chr .  "_SIFTDB_stats.txt";
    `python2 check_SIFTDB.py $db_file $check_outfile`;

}`

yywyaoyaowu commented 3 years ago

I also try my $metafile=/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/, didn't work

pauline-ng commented 3 years ago

Quotes are needed. Also, end with a semicolon.

my $metafile= "/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/Potato_A6_26_config.txt" ;

yywyaoyaowu commented 3 years ago

I change to python 2.7, and run perl check_SIFTDB_only.pl. It seems chr01_scores.Srecords.with_dbSNPid.sorted is not well generated. Attached is error I got Traceback (most recent call last): File "check_SIFTDB.py", line 301, in process_chr_file (infile, DNA_base_change_syn_nonsyn_counts, DNA_base_change_aa_pred, all_nonsyn_changes, dbSNP_predictions, novel_predictions, ref_predictions, aa_change_pred) File "check_SIFTDB.py", line 200, in process_chr_file siftdb_fp = open (siftdb_infile, "r") IOError: [Errno 2] No such file or directory: '/workdir/yw2326/SIFT/SIFT4G_Create_Genomic_DB/Potato_A6_26/singleRecords_with_scores/chr01_scores.Srecords.with_dbSNPid.sorted' Traceback (most recent call last): File "check_SIFTDB.py", line 301, in process_chr_file (infile, DNA_base_change_syn_nonsyn_counts, DNA_base_change_aa_pred, all_nonsyn_changes, dbSNP_predictions, novel_predictions, ref_predictions, aa_change_pred) File "check_SIFTDB.py", line 200, in process_chr_file siftdb_fp = open (siftdb_infile, "r")

pauline-ng commented 3 years ago

Sorry Yaoyao, The files were deleted when the script failed.

Edit make-SIFT-db-all.pl

Comment out lines 150 & 151 by putting a "#" in front:

# my $rm_dir = $meta_hash{"PARENT_DIR"} . "/" . $meta_hash{"SINGLE_REC_WITH_SIFTSCORE_DIR"} . "/*";
# system ("rm $rm_dir");

Then you'll have to regenerate the database again so the files will still be there.

yywyaoyaowu commented 3 years ago

I have re-construct the database, I go to // and check the data, "SIFT_alignments" is empty zcat .gz | more , zcat .gz | grep CDS | more , CHECK_GENES.LOG seem are all well produced. Then I checked the pre-files: ls -lt /singleRecords/ has files `ls fasta/ has files SIFT 4G Algorithm is running ls q/* with files

I have some file in the directory singleRecords, there were chr.invalidRecords; 'chr.singleRecords'; and 'chr.singleRecords_proteins.fa', ''but the file '*.singleRecords_noncoding.with_dbSNPid' is empty.

And also I have file in 'singleRecords_with_scores', the file 'chr*scores.Srecords.with_dbSNPid.sorted'. The two file 'chr01_scores.Srecords.with_dbSNPid.ERRR and chr01_scores.Srecords.with_dbSNPid.LOG were there but empty file.

yywyaoyaowu commented 3 years ago

There were error or warning 'Use of uninitialized value $aa2 in string eq at generate-fasta-subst-files-BIOPERL.pl line 621.' Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 280.

pauline-ng commented 3 years ago

The errors above you should ignore, and it should still make a database.

pauline-ng commented 3 years ago

Closing due to inactivity