How to obtain all annotation files and sequence files (tDNA) of a custom genome?

Hi,

Thank you for your interest in the software.

There are several files that you would need to generate, and organized in a folder structure similar to that found here

The `annotation subfolder should contain:

TE.bed - BED6 file of TE annotation. We typically obtain this from UCSC RepeatMasker track.
- The name (column 4) is in the format: [class]:[family]:[subfamily]:[instance]. E.g. LTR:Gypsy:IDEFIX_LTR:IDEFIX_LTR_copy1.
exon.bed - BED6 file of gene exons. We typically obtain this from UCSC Refseq or RefGene tracks, but any gene annotation should be compatible.
- The name (column 4) is in the format [gene_id/gene_name]:[transcript_id]:exon_[exon_number]. E.g. CG11023:NM_175941.2:exon_1.
- We collapse identical exons from multiple transcripts to a non-redundant set of exonic positions, with multiple exon annotations collapsed into the name and separated by , (typically generated by bedtools groupBy).E.g.
```
chr2L   7528    8116    CG11023:NM_001169365.1:exon_0,CG11023:NM_001272857.1:exon_0,CG11023:NM_175941.2:exon_0  0       +
```
hairpin.bed - BED6 file of miRNA hairpin annotation.
- We typically obtain this from the miRBase GFF, using the miRNA_primary_transcript entries
intron.bed - BED6 file of gene introns, obtained from the same source as the gene exons.
- We use a simliar format in the name(column 4): [Gene ID]:[Transcript ID]:intron_[intron number]
- We also collapse identical introns from multiple transcripts to a non-redundant set of intronic positions, like with the exons.bed file.
miRNA.bed - BED6 file of mature miRNA.
- We typically obtain this from the miRBase GFF, using the miRNA entries
piRNA_cluster.bed - BED6 file of piRNA cluster.
- We have used the piRNA db.
- If there are no known annotations, the files can be left blank (but must exist). As you can imagine, this would mean that those features would not be annotated.
structural_RNA.bed - BED6 file of structural RNA.
- We typically obtain this from UCSC RepeatMasker track, taking the rRNA, scRNA, snRNA, srpRNA and tRNA annotations.
- The name (column 4) is in the format: sncRNA:[sncRNA type]:[sncRNA name]:[sncRNA copy]`

The `sequence subfolder should contain:

genome.fa - FASTA sequence of the genomic sequence
genome.fa.fai - FASTA index of genome.fa, generated by samtools faidx
rDNA.fa - FASTA of large and small ribosomal RNA subunit.
- We have used the SILVA database
rDNA.fa.fai - FASTA index of rDNA.fa, generated by samtools faidx

tDNA.fa - FASTA of tRNA sequences.

We have used the GtRNAdb, but could also be extracted from the structural_RNA.bed as follows:

$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.*copy[0-9]*//' > tRNA.bed
$ bedtools getfasta -s -name -fi genome.fa -bed tRNA.bed -fo tDNA.fa
$ sed -i '/>/s/::/:/; />/s/(/:/; />/s/)//;' tDNA.fa

tDNA.fa.fai - FASTA index of tDNA.fa, generated by samtools faidx
bowtie_index subfolder
- genome.*.ebwt or genome.*.ebwtl - Bowtie index of genome FASTA, using genome as the prefix
- rDNA.*.ebwt - Bowtie index of rDNA FASTA, using rDNA as the prefix
- tDNA.*.ebwt - Bowtie index of tDNA FASTA, using tDNA as the prefix

All of this should be in a folder named after your custom genome build (e.g. for human T2T build, we called the folder T2Tv2), which should be able to be called from TEsmall (as of version 2.0.5) using the custom genome name (as long as it's located in the genomes subfolder of the folder indicated by --dbfolder.

I understand that this is a lot of information, and we could provide some help with your custom genome. However, we can't guarantee how easy/hard it is given the varying style of annotations.

Please don't hesitate to reach out if you encounter major issues.

Thanks.

mhammell-laboratory / TEsmall

How to obtain all annotation files and sequence files (tDNA) of a custom genome? #18