pcingola / SnpEff

Other
237 stars 76 forks source link

how to build a .gtf for simple ATG-to-stop codon CDS references #500

Open avilella opened 9 months ago

avilella commented 9 months ago

I followed the instructions on how to build a .gtf file, for a simple use case where we have a CDS without introns from the ATG-to-the-stop codon.

When I create it in this manner:

cat Khanna_j4h_8_DMX8/genes.gtf | tabtk view
Khanna_j4h_8_DMX8       ensembl gene            1       1340    .       +       .       gene_id "Khanna_j4h_8_DMX8.1"; gene_type "protein_coding";
Khanna_j4h_8_DMX8       ensembl transcript      1       1337    .       +       .       parent "Khanna_j4h_8_DMX8.1"; gene_id "Khanna_j4h_8_DMX8.1"; transcript_id "Khanna_j4h_8_DMX8.2"; transcript_type "protein_coding";
Khanna_j4h_8_DMX8       ensembl exon            1       1337    .       +       .       transcript_id "Khanna_j4h_8_DMX8.2";
Khanna_j4h_8_DMX8       ensembl CDS             1       1337    .       +       0       transcript_id "Khanna_j4h_8_DMX8.2";

I get the following report from running the build command:

java -jar ~/snpEff/snpEff.jar build -gtf22 -v Khanna_j4h_8_DMX8

00:00:00 SnpEff version SnpEff 5.2 (build 2023-09-29 06:17), by Pablo Cingolani
00:00:00 Command: 'build'          
00:00:00 Building database for 'Khanna_j4h_8_DMX8'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'Khanna_j4h_8_DMX8'
00:00:00 Reading config file: /home/penguin/snpEff/data/snpEff.config
00:00:00 Reading config file: /home/penguin/snpEff/snpEff.config
00:00:00 done
00:00:00 Reading GTF22 data file  : '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/genes.gtf'
00:00:00 Reading file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/genes.gtf'
00:00:00 
        Total: 3 markers added.
00:00:00 Create exons from CDS (if needed): 
00:00:00 Exons created for 0 transcripts.
00:00:00 Deleting redundant exons (if needed): 
00:00:00        Total transcripts with deleted exons: 0
00:00:00 Collapsing zero length introns (if needed):  
00:00:00        Total collapsed transcripts: 0
00:00:00        Reading sequences   :
00:00:00        FASTA file: '/home/penguin/snpEff/./data/genomes/Khanna_j4h_8_DMX8.fa' not found.
00:00:00        Reading FASTA file: '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/sequences.fa'
00:00:00                Reading sequence 'Khanna_j4h_8_DMX8', length: 1340
00:00:00                Adding genomic sequences to genes: 
00:00:00        Done (1 sequences added).
00:00:00                Adding genomic sequences to exons: 
00:00:00        Done (1 sequences added, 0 ignored).
00:00:00        Total: 1 sequences added, 0 sequences ignored.
00:00:00 Finishing up genome
00:00:00 Adjusting transcripts: 
00:00:00 Adjusting genes: 
WARNING_GENE_COORDINATES: Gene 'Khanna_j4h_8_DMX8.1' (name:'Khanna_j4h_8_DMX8.1'), adjusting end coordinate from 1339 to 1336
00:00:00 Adjusting chromosomes lengths: 
00:00:00 Ranking exons: 
00:00:00 Create UTRs from CDS (if needed):                                                                                                                                                                [12/1932]
00:00:00 Correcting exons based on frame information.

00:00:00 
00:00:00 Remove empty chromosomes: 
00:00:00 Marking as 'coding' from CDS information: 
00:00:00 Done: 0 transcripts marked
00:00:00 
00:00:00 #-----------------------------------------------
# Genome name                : 'Khanna_j4h_8_DMX8'
# Genome version             : 'Khanna_j4h_8_DMX8'
# Genome ID                  : 'Khanna_j4h_8_DMX8[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 1
# Protein coding genes       : 1
#-----------------------------------------------
# Transcripts                : 1
# Avg. transcripts per gene  : 1.00
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        : 
#               AA sequences :      0 ( 0.00% )
#              DNA sequences :      0 ( 0.00% )
#-----------------------------------------------
# Protein coding transcripts : 1
#              Length errors :      1 ( 100.00% )
#  STOP codons in CDS errors :      0 ( 0.00% )
#         START codon errors :      0 ( 0.00% )
#        STOP codon warnings :      0 ( 0.00% )
#              UTR sequences :      0 ( 0.00% )
#               Total Errors :      1 ( 100.00% )
# WARNING                    : No protein coding transcript has UTR
#-----------------------------------------------
# Cds                        : 1
# Exons                      : 1
# Exons with sequence        : 1
# Exons without sequence     : 0
# Avg. exons per transcript  : 1.00
#-----------------------------------------------
# Number of chromosomes      : 1
# Chromosomes                : Format 'chromo_name size codon_table'
#               'Khanna_j4h_8_DMX8'     1340    Standard
#-----------------------------------------------

00:00:00 Caracterizing exons by splicing (stage 1) :  
00:00:00 Caracterizing exons by splicing (stage 2) :  
        00:00:00 done.          
00:00:00 [Optional] Rare amino acid annotations
WARNING_FILE_NOT_FOUND: Rare Amino Acid analysis: Cannot read protein sequence file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/protein.fa', nothing done.
ERROR: CDS check file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/cds.fa' not found.
ERROR: Protein check file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/protein.fa' not found.
ERROR: Database check failed.                                                                            
00:00:00 Logging                                                                                         
00:00:01 Checking for updates...                
00:00:02 Done.

Any ideas where the issue could be? Do the reference sequences need to have a minimal UTR?

pcingola commented 7 months ago

The error message says:

ERROR: CDS check file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/cds.fa' not found.
ERROR: Protein check file '/home/penguin/snpEff/./data/Khanna_j4h_8_DMX8/protein.fa' not found.
ERROR: Database check failed.       

This means that there was no reference CDS or Proteins FASTA files to check against. SnpEff will refuse to save any database without checking it.

Here is the link to the documentation: https://pcingola.github.io/SnpEff/snpeff/build_db/#step-3-checking-the-database