Closed zhangaicen closed 2 years ago
Hi,
Thank you for your interest in the software. If there is a file within that contains the location and identity of transposable elements (or repeats in general) in the MSU v7.0 database, you can either use a perl script (available here to convert the output into a GTF that is compatible to TEtranscripts, or we can work with you to generate the GTF. Unfortunately, at the time of this reply, I am unable to access the URL that you provided.
Thanks.
This is the usage information for the perl script:
Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column]
-o [strand column] -n [source] -t [TE name column]
(-f [TE family column] -C [TE class column] -1)
[INFILE]
Output is printed to STDOUT
Required parameters:
-c [chrom column] - Column containing chromosome name
-s [start column] - Column containing feature start position
-e [stop/end column] - Column containing feature stop/end position
-o [strand column] - Column containing strand information (+ or -/C)
-t [TE name column] - Column containing TE name
[INFILE] - File name to be processed into GTF
Optional parameters:
-n [source] - Source of the TE information
(e.g. mm9_rmsk for RepeatMasker track from
mm9 mouse genome)
Defaults to "user-provided" if not specified
-f [TE family column] - Column containing TE family name.
Defaults to TE name if not specified
-C [TE class column] - Column containing TE class name.
Defaults to TE family name if not specified
-S [score column] - Column containing the score of the TE prediction
(e.g. score from RepeatMasker)
-1 - Input coordinates uses 1-based indexing
This should be used if the input file uses
1-based coordinates. This should be invoked
if the genomic coordinates are obtained from
a GFF3, GTF, SAM or VCF file
Default: off if using BED, BAM or UCSC rmsk
input files
Hi Oliver,
thanks for your quick response, I had the TE.bed file as below:
and use the makeTEgtf.pl I've converted it into a GTF file:
So it is the suitable GTF file for further analysis?
best, Thanks.
Hi,
It looks quite reasonable to me. The only thing that might be annoying is that your annotations will read as follows:
DNA/Helitron:DNA/Helitron:DNA/Helitron
This is because the program concatenates the gene, family and class ID together. This will not break the program, but the names are just long.
Otherwise, I think it should be good to use.
Let me know if you encounter any more issues.
Thanks.
OK, thanks Oliver, may disturb you if any other problems in the future.
Thanks and best wishes.
Hi Oliver, I’m trying to quantify the rice TE expression using the bam file from Hisat2 with --no-mixed, as a newer of TE analysis, there are still some problems make me confused:
To better understand this software, I tried TEtranscript and TElocal, the software can help to build the index, but TElocal report an error ”TE annotation file needs to be a TElocal index, which will end in .locInd“, and to save time for every running, is there a method to pre-prepare .ind and .loclnd index files?
In the reference: DNA hypomethylation in tetraploid rice potentiates stress-responsive gene expression for salt tolerance , I see the author display the TE expression between two samples processed by TEtranscript using boxplot,
so which result coming from TEtranscript can represent TE expression like gene expression(using FPKM,CPM...)
3.In my log of transcripts, I saw very few reads were annotated as TE reads, is it reasonable or something wrong with my workflow?
Thanks for your help.
Hi,
.locInd
) than TEtranscripts (.ind
). We can try to build the indices for you, but will need the TE GTF that you generated..cntTable
) can be put through differential analysis pipelines (e.g. DESeq2, which is called by TEtranscripts). You can then generate normalized counts and/or variance stabilized counts, one of which I'm assuming were used for the plot. However, I would refer to their methods for the precise details.Thanks.
Hi,
Hi,
gene_id
), and so you end up with 159 distinct TE "subfamiliies". It is not unusual to have zero counts for some subfamilies if they are not active, but this is also dependent on the completeness of the TE annotation, and whether you allowed for sufficient multimappers during alignments (we have typically found up to 100 multimappers to be useful for Drosophila and mammals).Please let me know if you have other questions. Thanks.
Hello Oliver, I'd like to use TEtranscripts to analyze TE in rice (Oryza sativa japonica), however I don't know how to get the TE GTF and index files in the given website, while most of the files in the web is about animals. My reference genome is Nipponbare reference genome (MSU v7.0, [http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/]).
So could you tell me how can I get the TE GTF file?
Thanks very much for your time.