pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

Error in building database for new species #78

Closed rebecca-sh closed 10 months ago

rebecca-sh commented 1 year ago

Hello,

I've had some issues generating a database for my species. I have managed to get sift4g to run and it appears it is working. For example this is the output from one of my runs:superscaffold16.txt

However, when I try to run Sift4G Annotator, I've realised that no .regions files are generated during my runs, no files populate my SIFT_alignments directory and there are no files in my singleRecords_with_scores directory at the end of the run. Here is what the PARENT_DIR for one of my runs looks like:

shawr@EI-HPC interactive Super-Scaffold_16_PARENTDIR]$ ls -lthr total 1.1M drwxrwx--- 2 shawr EI_ga011 0 Mar 25 11:19 SIFT_alignments drwxrwx--- 2 shawr EI_ga011 0 Mar 25 11:19 dbSNP -rwxrwx--- 1 shawr EI_ga011 0 Mar 25 11:20 invalid.log -rwxrwx--- 1 shawr EI_ga011 0 Mar 25 11:20 Log2.txt drwxrwx--- 2 shawr EI_ga011 6.0K Mar 25 11:20 subst drwxrwx--- 2 shawr EI_ga011 4.9K Mar 25 11:20 fasta -rwxrwx--- 1 shawr EI_ga011 49K Mar 25 11:20 peptide.log -rwxrwx--- 1 shawr EI_ga011 71 Mar 25 11:20 fasta.log -rwxrwx--- 1 shawr EI_ga011 51K Mar 25 11:20 all_prot.fasta drwxrwx--- 2 shawr EI_ga011 12K Mar 25 12:09 SIFT_predictions drwxrwx--- 2 shawr EI_ga011 291 Mar 25 12:09 singleRecords drwxrwx--- 2 shawr EI_ga011 110 Mar 25 12:10 Super-Scaffold_16 drwxrwx--- 2 shawr EI_ga011 0 Mar 25 12:10 singleRecords_with_scores drwxrwx--- 2 shawr EI_ga011 77 Mar 25 12:12 chr-src drwxrwx--- 2 shawr EI_ga011 112 Mar 25 12:16 gene-annotation-src

I have also removed any protein sequences that had any unwanted characters - 'X', '*', '-'

Any help would be really appreciated! Many thanks,

Becky

pauline-ng commented 1 year ago

Hi,

Can you list the files in your Super-Scaffold_16 directory? (Can you ls and show the filenames)

Do your VCF file chromosome names match the filenames in the Super-Scaffold_16 directory?

rebecca-sh commented 1 year ago

Hello,

Thanks for your quick response! Here are all the files in the directory superscaffold16dir.txt

Yes all chromosome names match. Thanks for checking this!

pauline-ng commented 1 year ago

Can you attach or show me what's in

Super-Scaffold_16/CHECK_GENES.LOG ?

rebecca-sh commented 1 year ago

cat CHECK_GENES.LOG Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores Super-Scaffold_16 100 (93/93) 100 (323194/323194) 74(238145/323194)

ALL 100 (93/93) 100 (323194/323194) 74(238145/323194)

pauline-ng commented 1 year ago

Thanks, your database is built correctly.

Can you show me the first few lines of your VCF file ?

cat <vcf_file> | grep -v ^# | head -5

rebecca-sh commented 1 year ago

cat dm_superscaffold16.vcf | grep -v ^# | head -5 Super-Scaffold_16 187620 . C T 100 . MUSNIGT00018393 Super-Scaffold_16 29864325 . G A 100 . MUSNIGT00001300 Super-Scaffold_16 4963050 . C A 100 . MUSNIGT00018314 Super-Scaffold_16 32415183 . G A 100 . MUSNIGT00001351 Super-Scaffold_16 32390272 . A T 100 . MUSNIGT00001351

pauline-ng commented 1 year ago

Hi Rebecca,

The 8th column "INFO" in the VCF file is required. Please add that column.

https://www.internationalgenome.org/wiki/Analysis/vcf4.0

rebecca-sh commented 1 year ago

Thanks for your feedback Pauline! I'm using a different ferret genome to try and annotate my variants, but it still doesn't seem to be working - I get this error that the .regions file still could not be found.

I have tried to annotate my variants with the already established ferret database by aligning the coordinates and this seems to work. I think it could be an issue with the install as in looking closer at some of my runs this is the error I am getting:

/opt/scripts_to_build_SIFT_db/make_regions_file.py: line 1: import: command not found /opt/scripts_to_build_SIFT_db/make_regions_file.py: line 2: import: command not found /opt/scripts_to_build_SIFT_db/make_regions_file.py: line 3: import: command not found /opt/scripts_to_build_SIFT_db/make_regions_file.py: line 4: import: command not found /opt/scripts_to_build_SIFT_db/make_regions_file.py: line 6: syntax error near unexpected token (' /opt/scripts_to_build_SIFT_db/make_regions_file.py: line 6:def get_pos (line):'

usr/bin/env: python3: No such file or directory /usr/bin/env: python3: No such file or directory /usr/bin/env: python3: No such file or directory /usr/bin/env: python3: No such file or directory /usr/bin/env: python3: No such file or directory /usr/bin/env: python3: No such file or directory

I will check with how the software was compiled.

Thanks again

pauline-ng commented 1 year ago

You need python3 installed. Once you have python3, rerun the generation of the database, you should gave regions files in the folder.