pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

Perl problem test data #86

Closed gubrins closed 9 months ago

gubrins commented 9 months ago

Hi,

I am running the test data to build a new database, this is the code I use: perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt

And this is my config file:

GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=/mnt/DiscoA/sift4g_2/scripts_to_build_SIFT_db/test_files/homo_sapiens_small
ORG=homo_sapiens
ORG_VERSION=GRCh38.83
DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

#Running SIFT 4G
SIFT4G_PATH=/mnt/DiscoA/sift4g_2/bin/sift4g
PROTEIN_DB=/home/goliath/software/funannotate/uniprot_sprot.fasta

# Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=Log.txt
ZLOGFILE=Log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP

# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

Being this the output:

converting gene format to use-able input
done converting gene format
making single records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making single records template
making noncoding records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making noncoding records
make the fasta sequences
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making the fasta sequences
start siftsharp, getting the alignments
/mnt/DiscoA/sift4g_2/bin/sift4g -d /home/goliath/software/funannotate/uniprot_sprot.fasta -q /mnt/DiscoA/sift4g_2/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/all_prot.fasta --subst /mnt/DiscoA/sift4g_2/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/subst --out /mnt/DiscoA/sift4g_2/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **
* processing database part 2 (size ~0.25 GB): 100.00/100.00% *

** Aligning queries with candidate sequences **
* processing database part 1 (size ~1.00 GB): 100.00/100.00% *

** Selecting alignments with median threshold: 2.75 **
* processing queries: 100.00/100.00% *

** Generating SIFT predictions with sequence identity: 100.00% **
* processing queries: 100.00/100.00% *

done getting all the scores
populating databases
checking the databases
zipping up /mnt/DiscoA/sift4g_2/scripts_to_build_SIFT_db/test_files/homo_sapiens_small/chr-src/*
All done!

As you can see, it seems it finishes, but I am not sure about the previous messages about the flow operators. Also, what are exactly the main outputs of this? The ones found in the GRCh38.83 folder?

Thanks in advance!

pauline-ng commented 9 months ago

The previous messages are there to help debug if a problem exists.

The main output is what's found in the GRCh38.83 folder.

gubrins commented 9 months ago

Hi again, When trying with my own data, this is the logfile I get, however there are no files created in the cat9 folder:

Log file:

converting gene format to use-able input
done converting gene format
making single records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at make-single-records-BIOPERL.pl line 202, <IN_TX> line 74946.
Use of uninitialized value $exon_end in numeric lt (<) at make-single-records-BIOPERL.pl line 218.
Use of uninitialized value $exon_end in subtraction (-) at make-single-records-BIOPERL.pl line 230.
done making single records template
making noncoding records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making noncoding records
make the fasta sequences
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 435, <IN_TX> line 74946.
Use of uninitialized value $exon_end in numeric lt (<) at generate-fasta-subst-files-BIOPERL.pl line 457.
Use of uninitialized value $exon_end in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 471.
done making the fasta sequences
start siftsharp, getting the alignments
/mnt/DiscoA/sift4g/bin/sift4g -d /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta -q /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/all_prot.fasta --subst /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/subst --out /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **

And this is my config file:

GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome
ORG=felis_catus
ORG_VERSION=cat9
#DBSNP_VCF_FILE=Homo_sapiens.vcf.gz

#Running SIFT 4G
SIFT4G_PATH=/mnt/DiscoA/sift4g/bin/sift4g
PROTEIN_DB=/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta

# Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=Log.txt
ZLOGFILE=Log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP

# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

Any ideas why it is not working? Thanks!

pauline-ng commented 9 months ago

An internet search on this error

"Possible precedence issue with control flow operator"

suggests this is an issue with the bioperl version.

pauline-ng commented 9 months ago

Did the sample files work, but it's your own file that's not working?

gubrins commented 9 months ago

yes, the test data worked, as I got the message "All done!" For my data is not working... It is quite strange right?

pauline-ng commented 9 months ago

Is it possible you have special characters in your files? We test everything on linux, so if you run the command dos2unix on your input files, maybe that will get rid of special or hidden characters that are breaking the parser.

gubrins commented 9 months ago

as both genome and annotation as compressed, I cannot test it. In your tutorial it says that should be compressed, should I unzip them?

pauline-ng commented 9 months ago

The genome file should be uncompressed with just .fa That might be your problem. Just uncompress the genome file.

The gtf file should be compressed.

(Use the test smallhuman_config. as your template)

gubrins commented 9 months ago

I am rerunning everything again and seems to be working, I still have the same errors but this time is going further:

converting gene format to use-able input
done converting gene format
making single records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at make-single-records-BIOPERL.pl line 202, <IN_TX> line 67568.
Use of uninitialized value $exon_end in numeric lt (<) at make-single-records-BIOPERL.pl line 218.
Use of uninitialized value $exon_end in subtraction (-) at make-single-records-BIOPERL.pl line 230.
done making single records template
making noncoding records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making noncoding records
make the fasta sequences
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 435, <IN_TX> line 67568.
Use of uninitialized value $exon_end in numeric lt (<) at generate-fasta-subst-files-BIOPERL.pl line 457.
Use of uninitialized value $exon_end in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 471.
done making the fasta sequences
start siftsharp, getting the alignments
/mnt/DiscoA/sift4g/bin/sift4g -d /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta -q /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/all_prot.fasta --subst /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/subst --out /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **
* processing database part 322 (size ~0.25 GB): 100.00/100.00% *

** Aligning queries with candidate sequences **
* processing database part 1 (size ~1.00 GB): 50.00/100.00% *

Regarding the uncompressed genome file, in step 2b from Making a SIFT database from local genomic and gene annotation file (.gtf) it mentions the genome to be compressed. Should I be following that guideline or should I follow another one? Thanks!

pauline-ng commented 9 months ago

genome should be uncompressed.

gubrins commented 9 months ago

okey, I decompressed it and I ran it again. However, this is the output. It did finish with some errors:

done getting all the scores
populating databases
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365253.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365285.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365292.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365356.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365383.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365393.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365409.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365418.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365421.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365453.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365497.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365551.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365700.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365750.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365959.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019366105.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019366544.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019367653.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019368997.1.singleRecords_noncoding
Traceback (most recent call last):
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
          ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
    return int (fields[0])
           ^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019369694.1.singleRecords_noncoding
checking the databases
All done!

It finishes with "All done!" though, what do you think? The outputs found in the folder are files .gz and .regions files with the chromosome name, is that okay?

Thanks again!

pauline-ng commented 9 months ago

If you call python, does it call python3?

The SIFT scripts are expecting python3

gubrins commented 9 months ago

yes I have python3! Python 3.11.4

pauline-ng commented 9 months ago

And to be clear, when you type python, it calls python3 (and not python2?)

gubrins commented 9 months ago

Yes! Screenshot 2023-09-26 at 17 09 52

pauline-ng commented 9 months ago

Also, were you able to run the test files OK? (You checked that the databases were generated.)

gubrins commented 9 months ago

This is what I got from the test files: Screenshot 2023-09-26 at 17 10 48

pauline-ng commented 9 months ago

Wait -- if you said "The outputs found in the folder are files .gz and .regions files with the chromosome name, is that okay?"

If you have non-empty .gz and .regions files, you are good to go. If you like, you can paste contents from *_stats.txt file so I can check.

gubrins commented 9 months ago

yes, there are several of them, I imagine it did not find some small scaffolds (which are not needed as the main chromosomes are there). I have quite a lot of *_stats.txt files, so I add here one as an example if that's okay for you, thanks a lot for the help!

/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/cat9/NC_018734.3_SIFTDB_stats.txt
SYN 7644011
STOP-GAINED 155771
NONSYN  2735616

nonsyn/syn ratio: 0.36

All Counts: 58.6 1139728/1944994
dbSNP damaging: -1% 0/0
Not in dbSNP damaging: 58.6% 1139728/1944994

Reference predictions % damaging: 0.2 2390/1193141
% predicted on: 99.7679602915255
    A   C   G   T   
A   -1  62.7    55.3    65.7    
C   60.6    -1  55.8    51.1    
G   50.9    55.7    -1  60.6    
T   65.8    54.4    62.6    -1  

#####################################
pauline-ng commented 9 months ago

Yup, this is good. I'm looking at the line "% predicted on", and I see 98%. It's high, like it's supposed to be.

For the small scaffolds, maybe there weren't any coding genes and so SIFT didn't print anything out. Maybe it was the small scaffolds that were printing out the errors you saw.

You're good to go!

gubrins commented 9 months ago

Excellent!! Thanks a lot for the help and the patience Pauline!!