Closed fraca closed 6 years ago
Hi Marco,
Thanks for trying to build a SIFT database.
It should be OK to leave 2nd column as Ensembl, Havana, and not protein_coding.
Please add gene_biotype "protein_coding" to rows which are labelled as exon, CDS, stop_codon, and start_codon. The script will infer 5' and 3' UTRs by taking the exonic regions that are not within the start_codon and stop_codon.
Thank you, Pauline
Updated the SIFT code to no longer require start/stop codon coordinates in order to build a database.
However, RNAs (such as miRNA, lincRNA) will not be annotated. UTRs are still annotated.
Thanks, Pauline
Hi Pauline, This is not a issue, it is more a question about formatting. I'm trying to build a new database and I create my own gft file. Which are the row mandatory (CDS, exon, gene, start_codon, stop_codon, transcript ...)? In the readme you said that the 9th column (attribute column) says gene_biotype "protein_coding;" and 2nd column in the gtf file says 'protein_coding'. In the 2nd column in the Homo_sapiens.GRCh38.83_trimmed.gtf.gz specifies the database (ensemble, havana...) and I think is correct to leave like that. in the 9th column, In which rows (CDS, exon, gene, start_codon, stop_codon, transcript ...) do I have to put the gene_biotype "protein_coding"? Best,
Marco