Closed gubrins closed 1 year ago
The previous messages are there to help debug if a problem exists.
The main output is what's found in the GRCh38.83 folder.
Hi again, When trying with my own data, this is the logfile I get, however there are no files created in the cat9 folder:
Log file:
converting gene format to use-able input
done converting gene format
making single records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at make-single-records-BIOPERL.pl line 202, <IN_TX> line 74946.
Use of uninitialized value $exon_end in numeric lt (<) at make-single-records-BIOPERL.pl line 218.
Use of uninitialized value $exon_end in subtraction (-) at make-single-records-BIOPERL.pl line 230.
done making single records template
making noncoding records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making noncoding records
make the fasta sequences
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 435, <IN_TX> line 74946.
Use of uninitialized value $exon_end in numeric lt (<) at generate-fasta-subst-files-BIOPERL.pl line 457.
Use of uninitialized value $exon_end in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 471.
done making the fasta sequences
start siftsharp, getting the alignments
/mnt/DiscoA/sift4g/bin/sift4g -d /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta -q /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/all_prot.fasta --subst /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/subst --out /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/SIFT_predictions --sub-results
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *
** Searching database for candidate sequences **
And this is my config file:
GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial
PARENT_DIR=/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome
ORG=felis_catus
ORG_VERSION=cat9
#DBSNP_VCF_FILE=Homo_sapiens.vcf.gz
#Running SIFT 4G
SIFT4G_PATH=/mnt/DiscoA/sift4g/bin/sift4g
PROTEIN_DB=/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta
# Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=Log.txt
ZLOGFILE=Log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP
# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord
Any ideas why it is not working? Thanks!
An internet search on this error
"Possible precedence issue with control flow operator"
suggests this is an issue with the bioperl version.
Did the sample files work, but it's your own file that's not working?
yes, the test data worked, as I got the message "All done!" For my data is not working... It is quite strange right?
Is it possible you have special characters in your files? We test everything on linux, so if you run the command dos2unix
on your input files, maybe that will get rid of special or hidden characters that are breaking the parser.
as both genome and annotation as compressed, I cannot test it. In your tutorial it says that should be compressed, should I unzip them?
The genome file should be uncompressed with just .fa That might be your problem. Just uncompress the genome file.
The gtf file should be compressed.
(Use the test smallhuman_config. as your template)
I am rerunning everything again and seems to be working, I still have the same errors but this time is going further:
converting gene format to use-able input
done converting gene format
making single records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at make-single-records-BIOPERL.pl line 202, <IN_TX> line 67568.
Use of uninitialized value $exon_end in numeric lt (<) at make-single-records-BIOPERL.pl line 218.
Use of uninitialized value $exon_end in subtraction (-) at make-single-records-BIOPERL.pl line 230.
done making single records template
making noncoding records file
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
done making noncoding records
make the fasta sequences
Possible precedence issue with control flow operator at /home/goliath/miniconda3/lib/perl5/site_perl/5.22.0/Bio/DB/IndexedBase.pm line 791.
Use of uninitialized value within @adj_ends in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 435, <IN_TX> line 67568.
Use of uninitialized value $exon_end in numeric lt (<) at generate-fasta-subst-files-BIOPERL.pl line 457.
Use of uninitialized value $exon_end in subtraction (-) at generate-fasta-subst-files-BIOPERL.pl line 471.
done making the fasta sequences
start siftsharp, getting the alignments
/mnt/DiscoA/sift4g/bin/sift4g -d /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/uniref90/uniref90.fasta -q /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/all_prot.fasta --subst /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/subst --out /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome/SIFT_predictions --sub-results
** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *
** Searching database for candidate sequences **
* processing database part 322 (size ~0.25 GB): 100.00/100.00% *
** Aligning queries with candidate sequences **
* processing database part 1 (size ~1.00 GB): 50.00/100.00% *
Regarding the uncompressed genome file, in step 2b from Making a SIFT database from local genomic and gene annotation file (.gtf) it mentions the genome to be compressed. Should I be following that guideline or should I follow another one? Thanks!
genome should be uncompressed.
okey, I decompressed it and I ran it again. However, this is the output. It did finish with some errors:
done getting all the scores
populating databases
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365253.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365285.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365292.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365356.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365383.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365393.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365409.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365418.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365421.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365453.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365497.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365551.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365700.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365750.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019365959.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019366105.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019366544.1.singleRecords_noncoding
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019367653.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019368997.1.singleRecords_noncoding
Traceback (most recent call last):
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 68, in <module>
get_regions (chrom_file, out_file)
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 31, in get_regions
pos = get_pos (first_line)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/make_regions_file.py", line 8, in get_pos
return int (fields[0])
^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
Unable to read from /mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/singleRecords/NW_019369694.1.singleRecords_noncoding
checking the databases
All done!
It finishes with "All done!" though, what do you think? The outputs found in the folder are files .gz and .regions files with the chromosome name, is that okay?
Thanks again!
If you call python
, does it call python3?
The SIFT scripts are expecting python3
yes I have python3! Python 3.11.4
And to be clear, when you type python
, it calls python3 (and not python2?)
Yes!
Also, were you able to run the test files OK? (You checked that the databases were generated.)
This is what I got from the test files:
Wait -- if you said "The outputs found in the folder are files .gz and .regions files with the chromosome name, is that okay?"
If you have non-empty .gz and .regions files, you are good to go. If you like, you can paste contents from *_stats.txt file so I can check.
yes, there are several of them, I imagine it did not find some small scaffolds (which are not needed as the main chromosomes are there). I have quite a lot of *_stats.txt files, so I add here one as an example if that's okay for you, thanks a lot for the help!
/mnt/DiscoA/sift4g/scripts_to_build_SIFT_db/cat_genome2/cat9/NC_018734.3_SIFTDB_stats.txt
SYN 7644011
STOP-GAINED 155771
NONSYN 2735616
nonsyn/syn ratio: 0.36
All Counts: 58.6 1139728/1944994
dbSNP damaging: -1% 0/0
Not in dbSNP damaging: 58.6% 1139728/1944994
Reference predictions % damaging: 0.2 2390/1193141
% predicted on: 99.7679602915255
A C G T
A -1 62.7 55.3 65.7
C 60.6 -1 55.8 51.1
G 50.9 55.7 -1 60.6
T 65.8 54.4 62.6 -1
#####################################
Yup, this is good. I'm looking at the line "% predicted on", and I see 98%. It's high, like it's supposed to be.
For the small scaffolds, maybe there weren't any coding genes and so SIFT didn't print anything out. Maybe it was the small scaffolds that were printing out the errors you saw.
You're good to go!
Excellent!! Thanks a lot for the help and the patience Pauline!!
Hi,
I am running the test data to build a new database, this is the code I use:
perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt
And this is my config file:
Being this the output:
As you can see, it seems it finishes, but I am not sure about the previous messages about the flow operators. Also, what are exactly the main outputs of this? The ones found in the GRCh38.83 folder?
Thanks in advance!