mhammell-laboratory / TEsmall

A pipeline for profiling TE-derived small RNAs
GNU General Public License v3.0
6 stars 5 forks source link

Use non-supported organisms #8

Closed amitjavilaventura closed 2 years ago

amitjavilaventura commented 2 years ago

Hello,

I am working with small RNAs in species that are not in the list of supported organisms. I was wondering whether it is possible to use genomes other than those speciefied in the download site (https://labshare.cshl.edu/shares/mhammelllab/www-data/TEsmall/).

If so, are all the GTF files (i.e., GTFs for hairpins, miRNAs...) required?

Thank you very much.

Best regards, Adrià.

olivertam commented 2 years ago

Hi Adria,

Thank you for your interest in the software. It is possible to use genomes other than those we generated, though it would require some setup. You will need the following annotation files (stored in the annotation folder):

  1. BED file for TE (named TE.bed), typically generated from RepeatMasker or other repetitive sequence finder
  2. BED file for mature miRNA (named miRNA.bed) and premiRNA (named hairpin.bed), typically generated from miRBase
  3. BED file for gene exons (named exon.bed) and introns (named intron.bed), typically generated from RefSeq or other genic annotations
  4. BED file for structural RNA (named structural.bed), such as tRNA, snoRNA, snRNA, typically generated from RepeatMasker
  5. BED file for piRNA clusters (named piRNA_cluster.bed)

These files can be empty (it would mean that nothing would be annotated to those categories), but they need to exist.

Other files that are required (stored in the sequence folder):

  1. FASTA sequence (named genome.fa), and .fai (generated with samtools) for the genomic sequence. Please ensure that the chromosome names match the nomenclature in the annotation (e.g. chr1 in both, not chr1 and 1)
  2. FASTQ sequence (named rDNA.fa) and .fai for rDNA sequences, which you can get from SILVA, or extracted from the genome if you have a full list of ribosomal DNA location
  3. Indices for the two FASTQ files built with bowtie 1, stored in the subfolder bowtie_index

An example of the folder structure can be seen here, using dm6 as an example.

Once generated, the folder (e.g. dm6 in our example) should either be placed in the folder where other references are stored (default: $HOME/TEsmall_db/), or provided at run-time using the --dbfolder parameter (e.g. `--db_folder /path/to/dm6).

Please feel free to contact us if you encounter any issues, and we can try to help.

Thanks.