pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

Database creation – can’t open input files for new organism #54

Closed Melanie-Wilkinson closed 2 years ago

Melanie-Wilkinson commented 2 years ago

Hi Pauline,

I have successfully created the trial human database but can’t create a mango database using NCBI genome and GFF file [https://www.ncbi.nlm.nih.gov/genome/?term=txid29780[orgn](https://www.ncbi.nlm.nih.gov/genome/?term=txid29780%5borgn)]

GTF file I believe something is wrong with my GTF file but I’m unsure how to fix it (error code below). Here is my GTF file and the GFF file that was used to create it. GCF_011075055.1_CATAS_Mindica_2.1_genomic.gff.gz mangifera_indica.assembly_CATAS_Mindica_2.1_edited3.gtf.gz

This is what I have done:

  1. gene_biotype added: I have added = gene_biotype "protein_coding" to column 9 for CDS, exon and transcript (wasn’t sure about this one so tried both).
  2. I don’t have start or stop codons in column 3 but I see you have “updated the SIFT code to no longer require start/stop codon coordinates in order to build a database.”
  3. I have checked read/write access with ls -l
  4. I have triple checked the folders are all in the right spot
  5. GFF vs GTF: In column 3, the GTF file only has exon, CDS and transcript, it’s missing the ‘gene’ that is in the GFF file (along with other categories e.g. region, tRNA). Was there something wrong with the conversion from GFF to GTF?

Error code for: perl make-SIFT-db-all.pl -config mango_files/Alphonso_config.txt

entered mkdir /home/uqmwilk3/sift/scripts_to_build_SIFT_db/mango_files/mango_genome /assembly_CATAS_Mindica_2.13/sift/scripts_to_build_SIFT_db/mango_files/mango_genome converting gene format to use-able input ls: cannot access /gene-annotation-src: No such file or directory Unable to open for reading done converting gene format /*.gz: No such file or directorys_to_build_SIFT_db/mango_files/mango_genome DNA files do not exist or did not unzip properly ##################################### Genome fasta file

##################################### No MT or plastid: I do not have a MT or plastid so do I just have to:

  1. dna_protein_subs.pl - remove all the code for both the “sub chr_is_plastid” and “sub chr_is_mito”?
  2. config file - remove these two lines: MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial
pauline-ng commented 2 years ago

Hi Melanie,

I updated the SIFT scripts to be compatible with NCBI format. Please clone the latest repo and try running the mango genome again.

If you have no MT or plasmid, then the config file won't look use MITO_GENETIC_CODE_TABLE* . You can delete those lines if you wish.

Thanks, Pauline

Melanie-Wilkinson commented 2 years ago

Thanks Pauline. I reran it with the updated sift scripts and no editing of the GTF file and got the errors below. No database was created. The first 2 lines of the error were repeated many times.

Warning: unable to close filehandle FILEPROT properly. Warning: unable to close filehandle INVALID properly. Warning: unable to close filehandle FILEPROT properly. Warning: unable to close filehandle INVALID properly. Warning: unable to close filehandle FILEPROT properly. Warning: unable to close filehandle INVALID properly. Warning: unable to close filehandle INVALID properly. Warning: unable to close filehandle FILEPROT properly. done making single records template making noncoding records file Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at /sw7/RCC/BioPerl/1.007002/share/perl5/Bio/DB/IndexedBase.pm line 304. done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments cat: ./mango_files/mango_genome/fasta/*.fasta: No such file or directory /home/uqmwilk3/sift2/scripts_to_build_SIFT_db/sift4g/bin/sift4g -d /home/uqmwilk3/sift2/SIFT_databases/uniref90.fasta -q ./mango_files/mango_genome/all_prot.fasta --subst ./mango_files/mango_genome/subst --out ./mango_files/mango_genome/SIFT_predictions --sub-results Checking query data and substitutions files

EXITING! No valid queries to process.

pauline-ng commented 2 years ago

Please show me your config file and your directory structure.

You were able to run the test files OK?

Melanie-Wilkinson commented 2 years ago

I have run the test files with the new scripts and they worked fine and produced a database. I did move the SIFT4G_PATH and PROTEIN_DB files into a different folder for the mango run. I can rerun the test if you think moving these files might be the problem? (I double checked the paths were correct in the config file).

Here is my config file mango_config.txt

file structure: /sift2/scripts_to_build_SIFT_db/mango_files> ls -l total 1 -rw-r--r-- 1 uqmwilk3 qris-uq 820 Mar 28 17:20 mango_config.txt drwxr-xr-x 12 uqmwilk3 qris-uq 4096 Mar 30 15:05 mango_genome

/.../scripts_to_build_SIFT_db/mango_files/mango_genome> ls -l
total 69
-rw-r--r-- 1 uqmwilk3 qris-uq     0 Mar 29 12:55 all_prot.fasta
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 assembly_CATAS_Mindica_2.1
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 29 12:55 chr-src
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 dbSNP
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 fasta
-rw-r--r-- 1 uqmwilk3 qris-uq     0 Mar 29 12:55 fasta.log
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 gene-annotation-src
-rw-r--r-- 1 uqmwilk3 qris-uq     0 Mar 29 12:55 invalid.log
-rw-r--r-- 1 uqmwilk3 qris-uq     0 Mar 29 12:55 Log2.txt
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 SIFT_alignments
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 SIFT_predictions
drwxr-xr-x 2 uqmwilk3 qris-uq 65536 Mar 30 15:21 singleRecords
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 singleRecords_with_scores
drwxr-xr-x 2 uqmwilk3 qris-uq  4096 Mar 28 11:42 subst

    /.../mango_files/mango_genome/chr-src> ls -l
    total 436752
    -rw-r--r-- 1 uqmwilk3 qris-uq  50331648 Mar 29 12:55 directory.index
    -rw-r--r-- 1 uqmwilk3 qris-uq 396894574 Mar 28 10:53 GCF_011075055.1_CATAS_Mindica_2.1_genomic.fna

    /.../mango_files/mango_genome/gene-annotation-src> ls -l
    total 295840
    -rw-r--r-- 1 uqmwilk3 qris-uq 267800908 Mar 28 10:54 GCF_011075055.1_CATAS_Mindica_2.1_genomic.gff
    -rw-r--r-- 1 uqmwilk3 qris-uq   6150491 Mar 28 11:01 GCF_011075055.1_CATAS_Mindica_2.1_genomic.gtf.gz
    -rw-r--r-- 1 uqmwilk3 qris-uq  17234542 Mar 28 17:21 noncoding.txt
    -rw-r--r-- 1 uqmwilk3 qris-uq  11729400 Mar 28 17:21 protein_coding_genes.txt
pauline-ng commented 2 years ago
  1. Do you have write access to the directory ./mango_files/mango_genome ?

  2. Also, can you try changing PARENT_DIR in themango_config.txt to the full path instead of relative path?

Otherwise, your file structure and config file look correct. I tried running a smaller gene file (first 10K lines), and it worked for me.

This is my truncated file, it should generate a small database within 30 minutes. Please try moving your GCF_011075055.1_CATAS_Mindica_2.1_genomic.gtf.gz to another directory (temporarily) and see if this shorter file works for you? mangifera_indica.assembly_CATAS_Mindica_2.1_edited3.gtf.gz

pauline-ng commented 2 years ago

Here's my folder structure:

pauline@pauline-pc:/bigdrive/SIFT_databases/mango$ ls -lt chr-src
total 122756
-rw-r--r-- 1 pauline pauline      6461 Mar 30 21:11 directory.index.gz
-rw-rw-r-- 1 pauline pauline 125687063 Oct 20 10:28 GCF_011075055.1_CATAS_Mindica_2.1_genomic.fna.gz

ls -lt gene-annotation-src
total 24612
-rw-rw-r-- 1 pauline pauline  6187733 Mar 30 20:51 noncoding.txt
-rw-rw-r-- 1 pauline pauline  4242398 Mar 30 20:51 protein_coding_genes.txt
-rw-rw-r-- 1 pauline pauline  2182294 Mar 26 14:31 mangifera_indica.assembly_CATAS_Mindica_2.1_edited3.gtf.gz

The gtf file I downloaded from NCBI is different from yours but I don't think that matters, because I see a non-empty protein_coding_genes.txt in your folder, which means SIFT can correctly parse your gtf file.

Melanie-Wilkinson commented 2 years ago
  1. Yes I have write access to ./mango_files/mango_genome: drwxr-xr-x 3 uqmwilk3 qris-uq 4096 Mar 28 11:37 mango_files

  2. I will change to the full path in the next run.

GTF file I should have clarified the GTF file requirements with you after your update of the script. My edited version (which is the one that I sent in my first post) has gene_biotype "protein_coding" added to column 9 for CDS, exon and transcript. The GTF I ran after your updates was unedited (original from NCBI). Should I be using the edited version with gene_biotype "protein_coding" added after your updates?

Thanks so much for your quick responses! I will run it with the smaller edited file.

pauline-ng commented 2 years ago

Hi Melanie,

Please use the NCBI file: mangifera_indica.assembly_CATAS_Mindica_2.1_edited3.gtf.gz that can be downloaded from the NCBI website.

I made changes in the code so that it would be NCBI-compatible, and not require edits on your part.

If my smaller file worked for you, download the full file from NCBI and run.

Thanks, Pauline

Melanie-Wilkinson commented 2 years ago

Hi Pauline,

Thanks for clarifying!

The error I had before was a disc space issue (sorry!). I just got your smaller file to create a database so I'm running the file from NCBI now. Thanks for your help!

Thanks, Mel