Open LIUXING-bio opened 1 year ago
Thank, In addition, I found that I cannot download this software. What is the problem? Or is there any installation package provided besides git clone?
Hi,
Which software are you having trouble downloading? I am able to download TEsmall from github without issue.
Thanks.
Hi, Is there a way to install software using singularity pull?
Hi,
Unfortunately, I'm not an expert on Docker/Singularity, but I tried making a Docker container of TEsmall:
$ docker pull mhammelllab/tesmall
Let me know if it works, though I can't guarantee that I can fix the issues if they pop up.
Thanks.
Sorry, it do not seem to work very well,.I would like to ask if it is possible to use the TEsmall software for singularity, just like TEtranscripts,this will encourage more people to use it, because there will be many problems during the installation and operation process. Thank you.
python setup.py install running install /root/miniconda3/envs/TEsmall/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. setuptools.SetuptoolsDeprecationWarning, /root/miniconda3/envs/TEsmall/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. EasyInstallDeprecationWarning, running bdist_egg running egg_info creating TEsmall.egg-info writing TEsmall.egg-info/PKG-INFO writing dependency_links to TEsmall.egg-info/dependency_links.txt writing entry points to TEsmall.egg-info/entry_points.txt writing requirements to TEsmall.egg-info/requires.txt writing top-level names to TEsmall.egg-info/top_level.txt writing manifest file 'TEsmall.egg-info/SOURCES.txt' reading manifest file 'TEsmall.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching 'environment.txt'
Hi,
Let me look into it further.
Thanks.
Hi,
This is currently the best attempt:
$ singularity pull docker://mhammelllab/tesmall:latest
$ singularity exec tesmall_latest.sif TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
[--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
[-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]
optional arguments:
-h, --help show this help message and exit
-a STR, --adapter STR
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
-m INT, --minlen INT Discard trimmed reads that are shorter than INT. Reads
that are too short even before adapter removal are
also discarded. (default: 16)
-M INT, --maxlen INT Discard trimmed reads that are longer than INT. Reads
that are too long even before adapter removal are also
discarded. (default: 36)
-g STR, --genome STR Version of reference genome. default: hg38)
--maxaln INT Suppress all alignments for a particular read if more
than INT reportable alignments exist for it. (default:
100)
--mismatch INT Report alignments with at most INT mismatches.
(default: 0)
-o STR [STR ...], --order STR [STR ...]
Annotation priority. (default: structural_RNA miRNA
hairpin exon TE intron piRNA_cluster)
-p INT, --parallel INT
Parallel execute by INT CPUs. (default: 1)
-f STR [STR ...], --fastq STR [STR ...]
Input in FASTQ format. Compressed input is supported
and auto-detected from the filename extension (.gz).
-l STR [STR ...], --label STR [STR ...]
Unique label for each sample.
--dbfolder STR Custom location of TEsmall database folder (containing
the "genomes" folder). Defaults to $HOME/TEsmall_db/
--verbose INT Set verbose level. 0: only show critical message, 1:
show additional warning message, 2: show process
information, 3: show debug messages. DEFAULT:2
-v, --version show program's version number and exit
Please feel free to try it, though I cannot guarantee that it will have no errors.
Thanks.
Hi, I want to know if this is right ([E::idx_find_and_load] Could not retrieve index file for '*.bam')
(base) lab:TEsmall $ singularity exec ../../biosoft/tesmall_latest.sif TEsmall -f ../01siRNA/717-719/717/SRR4896717.fastq ../01siRNA/717-719/719/SRR4896719.fastq -l 717 719 --dbfolder TEsmall_db/ 2023-10-25 16:00:39,792 INFO Checking if reference genome and annotation files exist... 2023-10-25 16:00:39,792 INFO Genome and annotation files present 2023-10-25 16:00:39,792 INFO Trimming 3' adapters... Done 00:04:21 25,339,801 reads @ 10.3 µs/read; 5.81 M reads/minute 2023-10-25 16:05:01,534 INFO Trimming 5' adapters... Done 00:02:34 24,746,143 reads @ 6.2 µs/read; 9.63 M reads/minute 2023-10-25 16:07:35,891 INFO Removing rRNA-derived reads... 2023-10-25 16:19:41,667 INFO Aligning reads to reference sequences... [E::idx_find_and_load] Could not retrieve index file for '717.genome.bam' 2023-10-25 16:26:12,941 INFO Aligning CCA trimmed reads to tRNA sequences... 2023-10-25 16:26:28,934 INFO Finding 3' tRF mappers... 2023-10-25 16:28:59,852 INFO Assigning 3' tRFs to transposable elements... [E::idx_find_and_load] Could not retrieve index file for '717.trna_for_intersect.bam' 2023-10-25 16:29:48,917 INFO Assigning 3' tRFs to transposable elements... [E::idx_find_and_load] Could not retrieve index file for '717.3trf.bam' 2023-10-25 16:30:02,012 INFO Assigning reads to genomic features... [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam' [E::idx_find_and_load] Could not retrieve index file for '717.3trf_free.bam
Hi,
Thank you for your feedback. I was going to suggest (from your previous comments) to use the --dbfolder
parameter, but I noticed that you did.
The [E::idx_find_and_load] Could not retrieve index file for '*.bam'
warning is a known quirk of pysam
(see #11 and #15 for other reports of this). It has no impact on the running of the tool, so if those are the only errors, then the tool should run fine.
Thanks for testing the singularity container.
Thank you for your help. The purpose of running this software is to compare the differences in siRNA annotated to TE elements in two samples. Can this be obtained by modifying a certain parameter?
You should be able to obtain the small RNAs annotated to TE elements from the count summary file. You can then try to run differential analysis to see if any TE elements show altered small RNA counts. Please be aware that since small RNAs could change en-mass in an experiment, it might break the assumptions used by differential analysis algorithms designed for RNAseq.
If you are specifically looking at siRNA, you might need to dig into the dataset more, as you might want to restrict the read length of interest (you should be able to see the read length distribution for each annotation type in [prefix].anno.rlen.info
). You might then want to process the [prefix].anno
file, which contains the annotation of each read (as well as the rlen
), and filter out those reads that meet your criteria, then summarize them for each library, and then combine the two (or more) libraries together to a final count table. You can then try to do differential analysis.
Hope this is helpful.
Thanks.
May I ask if you can help me obtain the bat's annotation file:https://hgdownload.soe.ucsc.edu/hubs/GCF/000/325/575/GCF_000325575.1/ Thanks
---- Replied Message ---- | From | Oliver @.> | | Date | 03/09/2024 12:35 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Closed #17 as completed.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
I was able to generate the following:
They can be downloaded from here.
You will need still need these files (they can be empty if no annotations exist):
Thanks.
Thank you very much.
---- Replied Message ---- | From | Oliver @.> | | Date | 03/09/2024 23:59 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
I was able to generate the following:
TE.bed structural_RNA.bed exons.bed introns.bed
They can be downloaded from here.
You will need still need these files (they can be empty if no annotations exist):
hairpin.bed miRNA.bed piRNA_cluster.bed
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz
Thanks.
Sorry, I can not open the download link you provided.
---- Replied Message ---- | From | Oliver @.> | | Date | 03/10/2024 00:49 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
If possible, please send it to me like exon and intron file, thanks.
---- Replied Message ---- | From | Oliver @.> | | Date | 03/10/2024 00:49 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
I just found a few error in the exon and intron BED files. Please replace those with the ones attached here: exon.bed.gz intron.bed.gz
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
You should be able to download it now. Unfortunately, the files are too big to attach to GitHub
Thanks.
Thank you, this software brings a lot of convenience, but changing the format seems troublesome. Are there any relevant scripts or commands provided to facilitate the use of more genomes in the future?
---- Replied Message ---- | From | Oliver @.> | | Date | 03/11/2024 09:38 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
You should be able to download it now.
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
Due to the huge variability in the types of annotations available, the best I can do is to describe the formats of the various files.
The custom genome should have two subfolders:
annotation
sequence
In the annotation
subfolder, these files should be present (empty if no annotation exists for the genome)
TE.bed
- BED6 file, where the name (column 4) is in the format of [Class]:[Family]:[Element]:[Instance]. E.g. LTR:Gypsy:IDEFIX_LTR:IDEFIX_LTR_copy1
. We typically get this from RepeatMasker output.exon.bed
- BED6 file, where the name (column 4) is in the format of [Gene ID]:[Transcript ID]:exon_
[exon number]. E.g. CG11023:NM_175941.2:exon_1
. We collapse exons from multiple transcripts if they are identify (using bedtools groupBy
). E.g.
chr2L 7528 8116 CG11023:NM_001169365.1:exon_0,CG11023:NM_001272857.1:exon_0,CG11023:NM_175941.2:exon_0 0 +
hairpin.bed
- BED6 file (typically) generated from mirBase GFF using the miRNA_primary_transcript entriesintron.bed
- BED6 file similar to exon.bed
, except the name format is [Gene ID]:[Transcript ID]:intron_[intron number]miRNA.bed
- BED6 file (typically) generated from mirBase GFF using the miRNA entriespiRNA_cluster.bed
- BED6 file with the name format of cluster_[number]
structural_RNA.bed
- BED6 file, where the name (column 4) is in the format of sncRNA
:[type]:[specific type]:[copy]. E.g. sncRNA:rRNA:LSU-rRNA_Dme:LSU-rRNA_Dme_copy1
. We typically get this from RepeatMasker output. In the sequence
subfolder, you need the following:
genome.fa
- genomic sequencegenome.fa.fai
- samtools faidx
of the genomic FASTArDNA.fa
- sequence of ribosomal DNA, either from SILVA or extracted from genomic sequences using information from the structural_RNA.bed
filerDNA.fa.fai
- samtools faidx
of the rDNA FASTAtDNA.fa
- sequence of tDNA, either obtained from GtRNAdb, extracted from genomic sequences using information from the structural_RNA.bed
file. If doing the latter, please ensure that the sequence name is as follows: >[tRNA name]:[chromosome]:[start]-[end]:[strand]
. This is one approach to generate this:
$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.(copy[0-9]*//' > tRNA.bed
$ fastaFromBed -s -name -fi genome.fa -bed tRNA.bed -fo tDNA.fa
$ sed -i '/>/s/::/:/; />/s/(/:/; />/s/)//;' tDNA.fa
tDNA.fa.fai
- samtools faidx
of the tDNA FASTAbowtie_index
subfolder - bowtie1 indices of all the FASTA (genomic, rDNA and tDNA)Hope this is helpful.
Thanks.
Hi,
I'm afraid the attachment didn't come through as you replied by email. Could you provide the error messages?
Thanks.
2024-03-13 18:46:43,238 INFO Finding 3' tRF mappers...
Traceback (most recent call last):
File "/opt/conda/envs/TEsmall/bin/TEsmall", line 33, in <module>
sys.exit(load_entry_point('TEsmall==2.0.5', 'console_scripts', 'TEsmall')())
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/command_line.py", line 83, in main
cca_anno, residual_bam = handle_cca(multi_bam, tbtidx, annot_dir)
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 184, in handle_cca
trnabam = fix_tRNA_mapped_coor(cca_bam)
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 115, in fix_tRNA_mapped_coor
coor = location[2].split('-')
IndexError: list index out of range
I think the problem is the annotation and sequence of tRNA and rRNA. Can you correct it? Thank you
Hi,
It looks like it's failing at the 3'-tRF part. Could you show me what files were generated by TEsmall, and what their sizes are?
$ ls -l
$ ls -l *.change_coor.sam
Thanks.
29M Mar 13 21:02 488.3trf.bam
54M Mar 13 21:02 488.3trf_free.bam
100M Mar 13 21:02 488.aligned.rinfo
61M Mar 13 21:02 488_cca.fa
782M Mar 13 21:03 488.change_coor.sam
2.3K Mar 13 20:53 488.cutadapt1.log
1.4K Mar 13 20:54 488.cutadapt2.log
38K Mar 13 21:01 488.exceeded.fastq
91M Mar 13 21:01 488.genome.bam
389 Mar 13 21:01 488.genome.log
1.9M Mar 13 21:02 488.header.txt
610M Mar 13 21:00 488.rm_rRNA.fastq
125M Mar 13 21:00 488.rRNA.bam
332 Mar 13 21:00 488.rRNA.log
1.3G Mar 13 20:53 488.trimmed1.fastq
1.3G Mar 13 20:54 488.trimmed2.fastq
30M Mar 13 21:02 488.tRNA.bam
1.9M Mar 13 21:03 488.trna_for_intersect.sam
385 Mar 13 21:02 488.tRNA.log
25M Mar 13 21:02 488.unaligned.cca.fa
389M Mar 13 21:01 488.unaligned.fastq
Thank you for your patience.
Could you print the first 20 alignments of 488.change_corr.sam
$ awk '$1~!/^@/' 488.change_corr.sam | head -n 20
Thanks.
Ah, you are right, it is an issue with the tDNA file.
The header needs to be in the following format: >[tRNA name]:[chromosome]:[start]-[end]:[strand]
This is one approach to generate this:
$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.*copy[0-9]*//' > tRNA.bed
$ fastaFromBed -s -name -fi genome.fa -bed tRNA.bed -fo tDNA.fa
$ sed -i '/>/s/::/:/; />/s/(/:/; />/s/)//;' tDNA.fa
You will need to regenerate the tDNA.fa.fai
and the bowtie 1 indices, but they should be fast.
Please confirm that the sequence name now fits the format, and it should run properly.
Thanks for identifying the errors.
Hi,Error reoccurred
2024-03-13 21:53:06,376 INFO Finding 3' tRF mappers...
Traceback (most recent call last):
File "/opt/conda/envs/TEsmall/bin/TEsmall", line 33, in <module>
sys.exit(load_entry_point('TEsmall==2.0.5', 'console_scripts', 'TEsmall')())
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/command_line.py", line 83, in main
cca_anno, residual_bam = handle_cca(multi_bam, tbtidx, annot_dir)
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 184, in handle_cca
trnabam = fix_tRNA_mapped_coor(cca_bam)
File "/opt/conda/envs/TEsmall/lib/python3.7/site-packages/TEsmall-2.0.5-py3.7.egg/TEsmall/trf_module.py", line 122, in fix_tRNA_mapped_coor
newcoor = int(coor[0]) + int(read_line[3]) # seems like pybedtools takes 1 indexed bam and turns it to 0 index upon intersect check this
ValueError: invalid literal for int() with base 10: 'NW_006432945.1'
Could you show me your tDNA.fa header name?
Thanks.
>tRNA-Val-GTA:tRNA-Val-GTA_copy753:NW_006432945.1:52022-52097:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy754:NW_006432945.1:177358-177433:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy755:NW_006432945.1:1885284-1885358:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy756:NW_006432945.1:2065521-2065595:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy757:NW_006432945.1:2420160-2420235:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy758:NW_006432945.1:2684983-2685058:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy759:NW_006432945.1:3010771-3010846:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy760:NW_006432945.1:3334935-3335008:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy761:NW_006432945.1:3475017-3475091:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy762:NW_006432945.1:3554063-3554137:+
>tRNA-Val-GTA:tRNA-Val-GTA_copy763:NW_006432945.1:3981280-3981354:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy764:NW_006432945.1:4279870-4279938:+
>tRNA-His-CAY_:tRNA-His-CAY__copy2:NW_006432945.1:4756088-4756124:+
>tRNA-Ile-ATT:tRNA-Ile-ATT_copy11:NW_006432945.1:5254281-5254320:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy765:NW_006432945.1:5454751-5454825:-
>tRNA-Val-GTA:tRNA-Val-GTA_copy766:NW_006432945.1:6161634-6161709:+
Hi,
There was a typo in the previous command:
$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.*copy[0-9]*//' > tRNA.bed
Please try this.
Check that there are only four field delimited by semicolons :
in the sequence name.
Thanks.
Hi,how to set parameters if the adapter sequence of two input files is inconsistent?
---- Replied Message ---- | From | Oliver @.> | | Date | 03/13/2024 22:11 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
There was a typo in the previous command:
$ grep "tRNA" structural_RNA.bed | sed 's/sncRNA:tRNA://;s/:tRNA.copy[0-9]//'> tRNA.bed
Please try this. Check that there are only four field delimited by semicolons :.
Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
Do you mean this?
-a STR, --adapter STR
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
Thanks
Hi,What I mean is, if the adapter sequence of the control group A and the experimental group B are not consistent, how should we handle it? -a the software seem to default that the adapter sequence between A and B must be consistent. Thanks
---- Replied Message ---- | From | Oliver @.> | | Date | 03/16/2024 20:08 | | To | mhammell-laboratory/TEsmall @.> | | Cc | LIUXING-bio @.>, Author @.> | | Subject | Re: [mhammell-laboratory/TEsmall] How to obtain rDNA of other custom genomes? & Singularity/Docker? (Issue #17) |
Hi,
Do you mean this?
-a STR, --adapter STR Sequence of an adapter that was ligated to the 3' end. The adapter itself and anything that follows is trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
Thanks
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
You can run them separately, and join the corresponding output, count_summary.txt
, together.
Thanks.
Hi, May I ask if you can help me obtain the house mouse (C57B) annotation file:https://hgdownload.soe.ucsc.edu/hubs/GCA/921/999/865/GCA_921999865.2/ Thanks
Hi,
I cannot find miRNA and piRNA specific for this genome build.
You will need to rebuild the bowtie genome index with the C57BL6NJv3 genome FASTA, but you should be able to use the rDNA/tDNA indices without modification.
The TE.bed
and structural_RNA.bed
files were generated from the repeatMasker output.
The exon.bed
and intron.bed
files were generated from the xenoRefGene GTF file.
The four files are tar-balled and could be found here.
Thanks.
Hi,
Thank you for your interest in the software. We obtain the coordinates of ribosomal DNA in the genome of interest from the output of RepeatMasker (or other repetitive sequence identifier). We then obtain the FASTA sequence from the coordinates (using tools like bedtools. If you do not have any ribosomal DNA coordinates, you can try taking a rRNA databse (e.g. SILVA, and either use their sequences as is, or align those sequences to your genome sequences (perhaps with some mismatch) to identify which ones are actually present in your genome of interest.
Please let me know if this does not address your question.
Thanks.