How to obtain rDNA of other custom genomes? & Singularity/Docker?

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software. We obtain the coordinates of ribosomal DNA in the genome of interest from the output of RepeatMasker (or other repetitive sequence identifier). We then obtain the FASTA sequence from the coordinates (using tools like bedtools. If you do not have any ribosomal DNA coordinates, you can try taking a rRNA databse (e.g. SILVA, and either use their sequences as is, or align those sequences to your genome sequences (perhaps with some mismatch) to identify which ones are actually present in your genome of interest.

Please let me know if this does not address your question.

Thanks.

LIUXING-bio commented 1 year ago

Thank, In addition, I found that I cannot download this software. What is the problem? Or is there any installation package provided besides git clone?

olivertam commented 1 year ago

Hi,

Which software are you having trouble downloading? I am able to download TEsmall from github without issue.

Thanks.

LIUXING-bio commented 1 year ago

Hi, Is there a way to install software using singularity pull?

olivertam commented 1 year ago

Hi,

Unfortunately, I'm not an expert on Docker/Singularity, but I tried making a Docker container of TEsmall:

$ docker pull mhammelllab/tesmall

Let me know if it works, though I can't guarantee that I can fix the issues if they pop up.

Thanks.

LIUXING-bio commented 1 year ago

Sorry, it do not seem to work very well,.I would like to ask if it is possible to use the TEsmall software for singularity, just like TEtranscripts，this will encourage more people to use it, because there will be many problems during the installation and operation process. Thank you.

LIUXING-bio commented 1 year ago

python setup.py install running install /root/miniconda3/envs/TEsmall/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. setuptools.SetuptoolsDeprecationWarning, /root/miniconda3/envs/TEsmall/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. EasyInstallDeprecationWarning, running bdist_egg running egg_info creating TEsmall.egg-info writing TEsmall.egg-info/PKG-INFO writing dependency_links to TEsmall.egg-info/dependency_links.txt writing entry points to TEsmall.egg-info/entry_points.txt writing requirements to TEsmall.egg-info/requires.txt writing top-level names to TEsmall.egg-info/top_level.txt writing manifest file 'TEsmall.egg-info/SOURCES.txt' reading manifest file 'TEsmall.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching 'environment.txt'

olivertam commented 1 year ago

Hi,

Let me look into it further.

Thanks.

olivertam commented 1 year ago

Hi,

This is currently the best attempt:

$ singularity pull docker://mhammelllab/tesmall:latest
$ singularity exec tesmall_latest.sif TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
               [--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
               [-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -a STR, --adapter STR
                        Sequence of an adapter that was ligated to the 3' end.
                        The adapter itself and anything that follows is
                        trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
  -m INT, --minlen INT  Discard trimmed reads that are shorter than INT. Reads
                        that are too short even before adapter removal are
                        also discarded. (default: 16)
  -M INT, --maxlen INT  Discard trimmed reads that are longer than INT. Reads
                        that are too long even before adapter removal are also
                        discarded. (default: 36)
  -g STR, --genome STR  Version of reference genome. default: hg38)
  --maxaln INT          Suppress all alignments for a particular read if more
                        than INT reportable alignments exist for it. (default:
                        100)
  --mismatch INT        Report alignments with at most INT mismatches.
                        (default: 0)
  -o STR [STR ...], --order STR [STR ...]
                        Annotation priority. (default: structural_RNA miRNA
                        hairpin exon TE intron piRNA_cluster)
  -p INT, --parallel INT
                        Parallel execute by INT CPUs. (default: 1)
  -f STR [STR ...], --fastq STR [STR ...]
                        Input in FASTQ format. Compressed input is supported
                        and auto-detected from the filename extension (.gz).
  -l STR [STR ...], --label STR [STR ...]
                        Unique label for each sample.
  --dbfolder STR        Custom location of TEsmall database folder (containing
                        the "genomes" folder). Defaults to $HOME/TEsmall_db/
  --verbose INT         Set verbose level. 0: only show critical message, 1:
                        show additional warning message, 2: show process
                        information, 3: show debug messages. DEFAULT:2
  -v, --version         show program's version number and exit

Please feel free to try it, though I cannot guarantee that it will have no errors.

Thanks.

LIUXING-bio commented 1 year ago

Hi, I want to know if this is right ([E::idx_find_and_load] Could not retrieve index file for '*.bam')

(base) lab:TEsmall $ singularity exec ../../biosoft/tesmall_latest.sif TEsmall -f ../01siRNA/717-719/717/SRR4896717.fastq ../01siRNA/717-719/719/SRR4896719.fastq -l 717 719 --dbfolder TEsmall_db/ 2023-10-25 16:00:39,792 INFO Checking if reference genome and annotation files exist... 2023-10-25 16:00:39,792 INFO Genome and annotation files present 2023-10-25 16:00:39,792 INFO Trimming 3' adapters... Done 00:04:21 25,339,801 reads @ 10.3 µs/read; 5.81 M reads/minute 2023-10-25 16:05:01,534 INFO Trimming 5' adapters... Done 00:02:34 24,746,143 reads @ 6.2 µs/read; 9.63 M reads/minute 2023-10-25 16:07:35,891 INFO Removing rRNA-derived reads... 2023-10-25 16:19:41,667 INFO Aligning reads to reference sequences... [E::idx_find_and_load] Could not retrieve index file for '717.genome.bam' 2023-10-25 16:26:12,941 INFO Aligning CCA trimmed reads to tRNA sequences... 2023-10-25 16:26:28,934 INFO Finding 3' tRF mappers... 2023-10-25 16:28:59,852 INFO Assigning 3' tRFs to transposable elements... [E::idx_find_and_load] Could not retrieve index file for '717.trna_for_intersect.bam' 2023-10-25 16:29:48,917 INFO Assigning 3' tRFs to transposable elements... [E::idx_find_and_load] Could not retrieve index file for '717.3trf.bam' 2023-10-25 16:30:02,012 INFO Assigning reads to genomic features... [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam

olivertam commented 1 year ago

Hi,

Thank you for your feedback. I was going to suggest (from your previous comments) to use the --dbfolder parameter, but I noticed that you did. The [E::idx_find_and_load] Could not retrieve index file for '*.bam' warning is a known quirk of pysam (see #11 and #15 for other reports of this). It has no impact on the running of the tool, so if those are the only errors, then the tool should run fine.

Thanks for testing the singularity container.

LIUXING-bio commented 1 year ago

Thank you for your help. The purpose of running this software is to compare the differences in siRNA annotated to TE elements in two samples. Can this be obtained by modifying a certain parameter?

olivertam commented 1 year ago

You should be able to obtain the small RNAs annotated to TE elements from the count summary file. You can then try to run differential analysis to see if any TE elements show altered small RNA counts. Please be aware that since small RNAs could change en-mass in an experiment, it might break the assumptions used by differential analysis algorithms designed for RNAseq.

If you are specifically looking at siRNA, you might need to dig into the dataset more, as you might want to restrict the read length of interest (you should be able to see the read length distribution for each annotation type in [prefix].anno.rlen.info). You might then want to process the [prefix].anno file, which contains the annotation of each read (as well as the rlen), and filter out those reads that meet your criteria, then summarize them for each library, and then combine the two (or more) libraries together to a final count table. You can then try to do differential analysis.

Hope this is helpful.

Thanks.

LIUXING-bio commented 8 months ago

May I ask if you can help me obtain the bat's annotation file：https://hgdownload.soe.ucsc.edu/hubs/GCF/000/325/575/GCF_000325575.1/ Thanks

Closed #17 as completed.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

I was able to generate the following:

TE.bed
structural_RNA.bed
exons.bed
introns.bed

They can be downloaded from here.

You will need still need these files (they can be empty if no annotations exist):

hairpin.bed
miRNA.bed
piRNA_cluster.bed

Thanks.

LIUXING-bio commented 8 months ago

Thank you very much.

Hi,

I was able to generate the following:

TE.bed structural_RNA.bed exons.bed introns.bed

They can be downloaded from here.

You will need still need these files (they can be empty if no annotations exist):

hairpin.bed miRNA.bed piRNA_cluster.bed

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz

Thanks.

LIUXING-bio commented 8 months ago

Sorry, I can not open the download link you provided.

Hi,

I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

LIUXING-bio commented 8 months ago

If possible, please send it to me like exon and intron file, thanks.

Hi,

I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

You should be able to download it now. Unfortunately, the files are too big to attach to GitHub

Thanks.

LIUXING-bio commented 8 months ago

Thank you, this software brings a lot of convenience, but changing the format seems troublesome. Are there any relevant scripts or commands provided to facilitate the use of more genomes in the future?

Hi,

You should be able to download it now.

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

Due to the huge variability in the types of annotations available, the best I can do is to describe the formats of the various files.

The custom genome should have two subfolders:

annotation
sequence

In the annotation subfolder, these files should be present (empty if no annotation exists for the genome)

TE.bed - BED6 file, where the name (column 4) is in the format of [Class]:[Family]:[Element]:[Instance]. E.g. LTR:Gypsy:IDEFIX_LTR:IDEFIX_LTR_copy1. We typically get this from RepeatMasker output.
exon.bed - BED6 file, where the name (column 4) is in the format of [Gene ID]:[Transcript ID]:exon_[exon number]. E.g. CG11023:NM_175941.2:exon_1. We collapse exons from multiple transcripts if they are identify (using bedtools groupBy). E.g.
```
chr2L   7528    8116    CG11023:NM_001169365.1:exon_0,CG11023:NM_001272857.1:exon_0,CG11023:NM_175941.2:exon_0  0       +
```
hairpin.bed - BED6 file (typically) generated from mirBase GFF using the miRNA_primary_transcript entries
intron.bed - BED6 file similar to exon.bed, except the name format is [Gene ID]:[Transcript ID]:intron_[intron number]
miRNA.bed - BED6 file (typically) generated from mirBase GFF using the miRNA entries
piRNA_cluster.bed - BED6 file with the name format of cluster_[number]
structural_RNA.bed - BED6 file, where the name (column 4) is in the format of sncRNA:[type]:[specific type]:[copy]. E.g. sncRNA:rRNA:LSU-rRNA_Dme:LSU-rRNA_Dme_copy1. We typically get this from RepeatMasker output.

In the sequence subfolder, you need the following:

genome.fa - genomic sequence
genome.fa.fai - samtools faidx of the genomic FASTA
rDNA.fa - sequence of ribosomal DNA, either from SILVA or extracted from genomic sequences using information from the structural_RNA.bed file
rDNA.fa.fai - samtools faidx of the rDNA FASTA
tDNA.fa - sequence of tDNA, either obtained from GtRNAdb, extracted from genomic sequences using information from the structural_RNA.bed file. If doing the latter, please ensure that the sequence name is as follows: >[tRNA name]:[chromosome]:[start]-[end]:[strand]. This is one approach to generate this:
```
$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.(copy[0-9]*//' > tRNA.bed
$ fastaFromBed -s -name -fi genome.fa -bed tRNA.bed -fo tDNA.fa
$ sed -i '/>/s/::/:/; />/s/(/:/; />/s/)//;' tDNA.fa
```
tDNA.fa.fai - samtools faidx of the tDNA FASTA
bowtie_index subfolder - bowtie1 indices of all the FASTA (genomic, rDNA and tDNA)

Hope this is helpful.

Thanks.

olivertam commented 8 months ago

Hi,

I'm afraid the attachment didn't come through as you replied by email. Could you provide the error messages?

Thanks.

LIUXING-bio commented 8 months ago

2024-03-13 18:46:43,238 INFO Finding 3' tRF mappers...
Traceback (most recent call last):
  File "/opt/conda/envs/TEsmall/bin/TEsmall", line 33, in <module>
    sys.exit(load_entry_point('TEsmall==2.0.5', 'console_scripts', 'TEsmall')())
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/command_line.py", line 83, in main
    cca_anno, residual_bam = handle_cca(multi_bam, tbtidx, annot_dir)
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 184, in handle_cca
    trnabam = fix_tRNA_mapped_coor(cca_bam)
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 115, in fix_tRNA_mapped_coor
    coor = location[2].split('-')
IndexError: list index out of range

LIUXING-bio commented 8 months ago

genomes.zip

LIUXING-bio commented 8 months ago

I think the problem is the annotation and sequence of tRNA and rRNA. Can you correct it? Thank you

olivertam commented 8 months ago

Hi,

It looks like it's failing at the 3'-tRF part. Could you show me what files were generated by TEsmall, and what their sizes are?

$ ls -l

$ ls -l *.change_coor.sam

Thanks.

LIUXING-bio commented 8 months ago

29M Mar 13  21:02   488.3trf.bam
54M Mar 13  21:02   488.3trf_free.bam
100M    Mar 13  21:02   488.aligned.rinfo
61M Mar 13  21:02   488_cca.fa
782M    Mar 13  21:03   488.change_coor.sam
2.3K    Mar 13  20:53   488.cutadapt1.log
1.4K    Mar 13  20:54   488.cutadapt2.log
38K Mar 13  21:01   488.exceeded.fastq
91M Mar 13  21:01   488.genome.bam
389 Mar 13  21:01   488.genome.log
1.9M    Mar 13  21:02   488.header.txt
610M    Mar 13  21:00   488.rm_rRNA.fastq
125M    Mar 13  21:00   488.rRNA.bam
332 Mar 13  21:00   488.rRNA.log
1.3G    Mar 13  20:53   488.trimmed1.fastq
1.3G    Mar 13  20:54   488.trimmed2.fastq
30M Mar 13  21:02   488.tRNA.bam
1.9M    Mar 13  21:03   488.trna_for_intersect.sam
385 Mar 13  21:02   488.tRNA.log
25M Mar 13  21:02   488.unaligned.cca.fa
389M    Mar 13  21:01   488.unaligned.fastq

olivertam commented 8 months ago

Thank you for your patience.

Could you print the first 20 alignments of 488.change_corr.sam

$ awk '$1~!/^@/' 488.change_corr.sam | head -n 20

Thanks.

olivertam commented 8 months ago

Ah, you are right, it is an issue with the tDNA file. The header needs to be in the following format: >[tRNA name]:[chromosome]:[start]-[end]:[strand] This is one approach to generate this:

$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.*copy[0-9]*//' > tRNA.bed
$ fastaFromBed -s -name -fi genome.fa -bed tRNA.bed -fo tDNA.fa
$ sed -i '/>/s/::/:/; />/s/(/:/; />/s/)//;' tDNA.fa

You will need to regenerate the tDNA.fa.fai and the bowtie 1 indices, but they should be fast.

Please confirm that the sequence name now fits the format, and it should run properly.

Thanks for identifying the errors.

LIUXING-bio commented 8 months ago

Hi，Error reoccurred

2024-03-13 21:53:06,376 INFO Finding 3' tRF mappers...
Traceback (most recent call last):
  File "/opt/conda/envs/TEsmall/bin/TEsmall", line 33, in <module>
    sys.exit(load_entry_point('TEsmall==2.0.5', 'console_scripts', 'TEsmall')())
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/command_line.py", line 83, in main
    cca_anno, residual_bam = handle_cca(multi_bam, tbtidx, annot_dir)
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 184, in handle_cca
    trnabam = fix_tRNA_mapped_coor(cca_bam)
  File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 122, in fix_tRNA_mapped_coor
    newcoor = int(coor[0]) + int(read_line[3])  # seems like pybedtools takes 1 indexed bam and turns it to 0 index upon intersect check this
ValueError: invalid literal for int() with base 10: 'NW_006432945.1'

olivertam commented 8 months ago

Could you show me your tDNA.fa header name?

Thanks.

LIUXING-bio commented 8 months ago

>tRNA-Val-GTA:tRNA-Val-GTA_copy753:NW_006432945.1:52022-52097:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy754:NW_006432945.1:177358-177433:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy755:NW_006432945.1:1885284-1885358:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy756:NW_006432945.1:2065521-2065595:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy757:NW_006432945.1:2420160-2420235:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy758:NW_006432945.1:2684983-2685058:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy759:NW_006432945.1:3010771-3010846:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy760:NW_006432945.1:3334935-3335008:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy761:NW_006432945.1:3475017-3475091:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy762:NW_006432945.1:3554063-3554137:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy763:NW_006432945.1:3981280-3981354:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy764:NW_006432945.1:4279870-4279938:+
>tRNA-His-CAY_:tRNA-His-CAY__copy2:NW_006432945.1:4756088-4756124:+
>tRNA-Ile-ATT:tRNA-Ile-ATT_copy11:NW_006432945.1:5254281-5254320:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy765:NW_006432945.1:5454751-5454825:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy766:NW_006432945.1:6161634-6161709:+

olivertam commented 8 months ago

Hi,

There was a typo in the previous command:

$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.*copy[0-9]*//' > tRNA.bed

Please try this. Check that there are only four field delimited by semicolons : in the sequence name.

Thanks.

LIUXING-bio commented 8 months ago

Hi,how to set parameters if the adapter sequence of two input files is inconsistent?

Hi,

There was a typo in the previous command:

$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.copy[0-9]//'> tRNA.bed

Please try this. Check that there are only four field delimited by semicolons :.

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

Do you mean this?

-a STR, --adapter STR
                        Sequence of an adapter that was ligated to the 3' end.
                        The adapter itself and anything that follows is
                        trimmed. (default: TGGAATTCTCGGGTGCCAAGG)

Thanks

LIUXING-bio commented 8 months ago

Hi，What I mean is, if the adapter sequence of the control group A and the experimental group B are not consistent, how should we handle it? -a the software seem to default that the adapter sequence between A and B must be consistent. Thanks

Hi,

Do you mean this?

-a STR, --adapter STR Sequence of an adapter that was ligated to the 3' end. The adapter itself and anything that follows is trimmed. (default: TGGAATTCTCGGGTGCCAAGG)

Thanks

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

olivertam commented 8 months ago

Hi,

You can run them separately, and join the corresponding output, count_summary.txt, together.

Thanks.

LIUXING-bio commented 5 months ago

Hi, May I ask if you can help me obtain the house mouse (C57B) annotation file：https://hgdownload.soe.ucsc.edu/hubs/GCA/921/999/865/GCA_921999865.2/ Thanks

olivertam commented 5 months ago

Hi,

I cannot find miRNA and piRNA specific for this genome build. You will need to rebuild the bowtie genome index with the C57BL6NJv3 genome FASTA, but you should be able to use the rDNA/tDNA indices without modification. The TE.bed and structural_RNA.bed files were generated from the repeatMasker output. The exon.bed and intron.bed files were generated from the xenoRefGene GTF file. The four files are tar-balled and could be found here.

Thanks.

mhammell-laboratory / TEsmall

How to obtain rDNA of other custom genomes? & Singularity/Docker? #17