pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

requirements gtf files #4

Closed fraca closed 6 years ago

fraca commented 7 years ago

Hi Pauline, This is not a issue, it is more a question about formatting. I'm trying to build a new database and I create my own gft file. Which are the row mandatory (CDS, exon, gene, start_codon, stop_codon, transcript ...)? In the readme you said that the 9th column (attribute column) says gene_biotype "protein_coding;" and 2nd column in the gtf file says 'protein_coding'. In the 2nd column in the Homo_sapiens.GRCh38.83_trimmed.gtf.gz specifies the database (ensemble, havana...) and I think is correct to leave like that. in the 9th column, In which rows (CDS, exon, gene, start_codon, stop_codon, transcript ...) do I have to put the gene_biotype "protein_coding"? Best,

Marco

pauline-ng commented 7 years ago

Hi Marco,

Thanks for trying to build a SIFT database.

  1. It should be OK to leave 2nd column as Ensembl, Havana, and not protein_coding.

  2. Please add gene_biotype "protein_coding" to rows which are labelled as exon, CDS, stop_codon, and start_codon. The script will infer 5' and 3' UTRs by taking the exonic regions that are not within the start_codon and stop_codon.

Thank you, Pauline

pauline-ng commented 4 years ago

Updated the SIFT code to no longer require start/stop codon coordinates in order to build a database.

However, RNAs (such as miRNA, lincRNA) will not be annotated. UTRs are still annotated.

Thanks, Pauline