There are several files that you would need to generate, and organized in a folder structure similar to that found here
The `annotation subfolder should contain:
TE.bed - BED6 file of TE annotation. We typically obtain this from UCSC RepeatMasker track.
The name (column 4) is in the format: [class]:[family]:[subfamily]:[instance]. E.g. LTR:Gypsy:IDEFIX_LTR:IDEFIX_LTR_copy1.
exon.bed - BED6 file of gene exons. We typically obtain this from UCSC Refseq or RefGene tracks, but any gene annotation should be compatible.
The name (column 4) is in the format [gene_id/gene_name]:[transcript_id]:exon_[exon_number]. E.g. CG11023:NM_175941.2:exon_1.
We collapse identical exons from multiple transcripts to a non-redundant set of exonic positions, with multiple exon annotations collapsed into the name and separated by , (typically generated by bedtools groupBy).E.g.
If there are no known annotations, the files can be left blank (but must exist). As you can imagine, this would mean that those features would not be annotated.
structural_RNA.bed - BED6 file of structural RNA.
We typically obtain this from UCSC RepeatMasker track, taking the rRNA, scRNA, snRNA, srpRNA and tRNA annotations.
The name (column 4) is in the format: sncRNA:[sncRNA type]:[sncRNA name]:[sncRNA copy]`
The `sequence subfolder should contain:
genome.fa - FASTA sequence of the genomic sequence
genome.fa.fai - FASTA index of genome.fa, generated by samtools faidx
rDNA.fa - FASTA of large and small ribosomal RNA subunit.
tDNA.fa.fai - FASTA index of tDNA.fa, generated by samtools faidx
bowtie_index subfolder
genome.*.ebwt or genome.*.ebwtl - Bowtie index of genome FASTA, using genome as the prefix
rDNA.*.ebwt - Bowtie index of rDNA FASTA, using rDNA as the prefix
tDNA.*.ebwt - Bowtie index of tDNA FASTA, using tDNA as the prefix
All of this should be in a folder named after your custom genome build (e.g. for human T2T build, we called the folder T2Tv2), which should be able to be called from TEsmall (as of version 2.0.5) using the custom genome name (as long as it's located in the genomes subfolder of the folder indicated by --dbfolder.
I understand that this is a lot of information, and we could provide some help with your custom genome. However, we can't guarantee how easy/hard it is given the varying style of annotations.
Please don't hesitate to reach out if you encounter major issues.
Hi,
Thank you for your interest in the software.
There are several files that you would need to generate, and organized in a folder structure similar to that found here
The `annotation subfolder should contain:
TE.bed
- BED6 file of TE annotation. We typically obtain this from UCSC RepeatMasker track.[class]:[family]:[subfamily]:[instance]
. E.g.LTR:Gypsy:IDEFIX_LTR:IDEFIX_LTR_copy1
.exon.bed
- BED6 file of gene exons. We typically obtain this from UCSC Refseq or RefGene tracks, but any gene annotation should be compatible.[gene_id/gene_name]:[transcript_id]:exon_[exon_number]
. E.g.CG11023:NM_175941.2:exon_1
.,
(typically generated bybedtools groupBy
).E.g.hairpin.bed
- BED6 file of miRNA hairpin annotation.miRNA_primary_transcript
entriesintron.bed
- BED6 file of gene introns, obtained from the same source as the gene exons.[Gene ID]:[Transcript ID]:intron_[intron number]
exons.bed
file.miRNA.bed
- BED6 file of mature miRNA.miRNA
entriespiRNA_cluster.bed
- BED6 file of piRNA cluster.structural_RNA.bed
- BED6 file of structural RNA.rRNA
,scRNA
,snRNA
,srpRNA
andtRNA
annotations.sncRNA:
[sncRNA type]:
[sncRNA name]:
[sncRNA copy]`The `sequence subfolder should contain:
genome.fa
- FASTA sequence of the genomic sequencegenome.fa.fai
- FASTA index ofgenome.fa
, generated bysamtools faidx
rDNA.fa
- FASTA of large and small ribosomal RNA subunit.rDNA.fa.fai
- FASTA index ofrDNA.fa
, generated bysamtools faidx
tDNA.fa
- FASTA of tRNA sequences.structural_RNA.bed
as follows:tDNA.fa.fai
- FASTA index oftDNA.fa
, generated bysamtools faidx
genome.*.ebwt
orgenome.*.ebwtl
- Bowtie index of genome FASTA, usinggenome
as the prefixrDNA.*.ebwt
- Bowtie index of rDNA FASTA, usingrDNA
as the prefixtDNA.*.ebwt
- Bowtie index of tDNA FASTA, usingtDNA
as the prefixAll of this should be in a folder named after your custom genome build (e.g. for human T2T build, we called the folder
T2Tv2
), which should be able to be called from TEsmall (as of version 2.0.5) using the custom genome name (as long as it's located in thegenomes
subfolder of the folder indicated by--dbfolder
.I understand that this is a lot of information, and we could provide some help with your custom genome. However, we can't guarantee how easy/hard it is given the varying style of annotations.
Please don't hesitate to reach out if you encounter major issues.
Thanks.