Closed shaymin84 closed 6 months ago
Hi,
Thank you for your interest in the software. The short answer to all your questions is that TEtranscripts operates on genomic alignment, rather than pseudoalignment to a transcript database. You will need to align to the genome for the tool to work.
# Example command on generating STAR index
$ STAR --runMode genomeGenerate \
--limitGenomeGenerateRAM 39000000000 \
--runThreadN 10 \
--genomeDir mm39_STAR_knownGenes \
--genomeFastaFiles mm39.fa \
--sjdbGTFfile mm39.knownGene.gtf \
--sjdbOverhang 150
$ STAR --genomeDir mm39_STAR_knownGenes \ --readFilesIn R1.fastq.gz R2.fastq.gz --readFilesCommand zcat \ --outFilterMultimapNmax 100 --winAnchorMultimapNmax 150 \ --runThreadN 10 --genomeLoad NoSharedMemory \ --outSAMattributes Standard --outFilterType BySJout \ --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs \ --limitBAMsortRAM 39000000000 \ --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \ --outSAMstrandField intronMotif --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 --alignIntronMin 20 --sjdbScore 1 \ --alignIntronMax 1000000 --alignMatesGapMax 1000000
2. You definitely need to use the same genome build, as TEtranscripts compares the genomic position of your read to the location of the gene/TE. Different versions of the mouse genome are not directly comparable. There are mm39 [TE]() and [knownGene](https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.knownGene.gtf.gz) GTF that could be used instead.
Hope this is helpful.
Hi
Thank you so much for the prompt response! I've done some research online and read through the discussion in closed threads here to learn more. I've only done alignment to transcriptome (not the whole genome) before, and only did it with bowtie2. Seems like it's more complicated than I thought. My main questions now are 2:
I've figured out how to download mm10 genome and index it. But from what I saw online, bowtie2 seems to not take into consideration the introns, so many reads would not be mapped. Does this make bowtie2 unusable for DNA genome alignment? Im trying using STAR as you suggest now. But when I ran the code you suggested to index mm39 for STAR, is showed an error std::bad_alloc. The cluster Im using should still have 4T of memory. Do you know why it shows the memory is running out while it clearly isn't?
in both TEcounts and TEtranscripts --GTF (for the coding genes) is a required argument. If I only care about transposon elements (protein coding genes had been carried out by Salmon), could I just put a random file in --GTF, ignore the results for all protein coding genes, and only see the TE results?
Hi,
Happy to answer any more questions and help with troubleshooting.
Hi
Thank you so much again for such a prompt response! That is very helpful.
When I tried to map with Bowtie2, turns out ~35% of reads are mapped once, ~30% mapped more than once. ~30% didn't map at all. Does this number look too low for you, thus not worth proceeding with TEtrancripts?
and yes, the problem was I didn't request for memory from the cluster. I just figured how to do this and it seems fine now.
Sorry again for these questions, Im quite new to this with very little training.
Hi,
The Bowtie2 results are on the low side, so I would probably map with STAR and see if it's similar or higher than Bowtie2. And I'm glad to hear that the indexing is now working. It might take a while, but should not take more than 12 hours.
Thanks.
Hi
the index is working and successfully done. I tried to run the second part of the code you suggested (Example command on running STAR for TEtranscripts) to do the alignment, but if failed with "fatal error".
The slurm file states "ReadAlignChunk_processChunks.cpp:204:processChunks EXITING because of FATAL ERROR in input reads: wrong read ID line format: the read ID lines should start with @ or > " "SOLUTION: verify and correct the input read files" "FATAL ERROR, exiting"
The input I gave were normal fa.gz files I got from sequencing, 2 files since it was pair-end sequencing. Do you know what parameters may be wrong?
----Edit---- I figured it may be due to the fasta file compression issue. I added --readFilesCommand zcat and it seems to work now
Hi,
Yes, you need the --readFilesCommand zcat
for fa.gz
files.
Hopefully it works now.
Thanks.
Hi,
Can TEtranscript be applied for locus-wise TE analysis? I modified the GTF file to make the loci distinct and used it with the tool, but it is taking very long time in calculating the index.
Hi,
Yes. It is in essence functioning as TEcount
in the TEtranscript software, except it does locus-level quantification.
You can run TElocal
for each sample in parallel on the cluster, and then join the output (e.g. with something like multijoin.pl
). You will have to make sure that the --project
parameter is used, or different runs will overwrite the output files.
Here's an example of how you could batch process the BAM files:
$ cat qsub_TElocal.sh
if [ -z "$1" ]; then
echo "Usage: $0 [BAM files ...]" >&2
exit 1
fi
for FILE in "$@"
do
PROJ=$(basename ${FILE} \.bam)
CMD="qsub -b y TElocal --stranded reverse --project ${PROJ} --sortByPos -b ${FILE} --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd"
echo "${CMD}"
${CMD}
done
$ qsub_TElocal.sh test1.bam test2.bam test3.bam
qsub -b y TElocal --stranded reverse --project test1 --sortByPos -b test1.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 1 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test2 --sortByPos -b test2.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 2 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test3 --sortByPos -b test3.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 3 ("TElocal") has been submitted
Let me know if you have further questions.
Thanks.
Hi,
Yes. It is in essence functioning as
TEcount
in the TEtranscript software, except it does locus-level quantification. You can runTElocal
for each sample in parallel on the cluster, and then join the output (e.g. with something likemultijoin.pl
). You will have to make sure that the--project
parameter is used, or different runs will overwrite the output files. Here's an example of how you could batch process the BAM files:$ cat qsub_TElocal.sh if [ -z "$1" ]; then echo "Usage: $0 [BAM files ...]" >&2 exit 1 fi for FILE in "$@" do PROJ=$(basename ${FILE} \.bam) CMD="qsub -b y TElocal --stranded reverse --project ${PROJ} --sortByPos -b ${FILE} --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd" echo "${CMD}" ${CMD} done $ qsub_TElocal.sh test1.bam test2.bam test3.bam qsub -b y TElocal --stranded reverse --project test1 --sortByPos -b test1.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd Your job 1 ("TElocal") has been submitted qsub -b y TElocal --stranded reverse --project test2 --sortByPos -b test2.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd Your job 2 ("TElocal") has been submitted qsub -b y TElocal --stranded reverse --project test3 --sortByPos -b test3.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd Your job 3 ("TElocal") has been submitted
Let me know if you have further questions.
Thanks.
Thank you so much.
If I want to retrieve the genomic coordinates (chromosome, start, end) of each locus that appear in the genomic table, is there a TE GTF file where can I match the ids of the loci (which appear like this HAL1_dup14731:HAL1:L1:LINE in the count table ) that I got in the count table with the genomic coordinates in that GTF file?
I also found these files in the lab webpage and they include TE GTF files used with TEtranscripts. Are they they same? Is the index was made from these GTF files?
https://www.dropbox.com/sh/1ppg2e0fbc64bqw/AACUXf-TA1rnBIjvykMH2Lcia?dl=0
Hi,
You can find a table for the coordinates here.
To answer your question, yes, the same GTF files used in TEtranscripts are also used to generate the TElocal pre-built indices.
Thanks.
Thank you so much.
I ran the tool and everything is fine, but I get like message below . Is it a concern ?
[E::idx_find_and_load] Could not retrieve index file for '.1711053357.7101445bam'
Thanks
Hi,
This is a pysam warning message, and has no impact on our software.
Thanks
Hi, "I am encountering an issue with this software. Regardless of whether I include the --sortByPos parameter or not, I still face this problem. I have also tried to reorder the BAM file according to coordinates, but the error persists. Why is this happening?"
If the BAM file is sorted by coordinates, please specify --sortByPos and re-run! Error: 0 [Exception type: SystemExit, raised in TEtranscripts:320]
Thanks
Hi,
Thank you for your interest in the software. Could you provide the full log output?
Thanks.
Hi,
The following is the complete log output.
INFO @ Mon, 01 Apr 2024 14:40:07:
#ARGUMENTS LIST:
#name = TEtranscripts_out
#treatment files = ['KO_1_chr.bam']
#control files = ['WT_1_chr.bam']
#GTF file = gencode.v34.annotation.gtf
#TE file = hg38.RepeatMaser.sorted.gtf
#multi-mapper mode = multi
#stranded = no
#differential analysis using DESeq2
#normalization = DESeq2_default
#FDR cutoff = 5.00e-02
#fold-change cutoff = 1.00
#read count cutoff = 1
#number of iteration = 100
#Alignments grouped by read ID = True
INFO @ Mon, 01 Apr 2024 14:40:07: Processing GTF files ...
INFO @ Mon, 01 Apr 2024 14:40:07: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
INFO @ Mon, 01 Apr 2024 14:54:50: Done building gene index ......
INFO @ Mon, 01 Apr 2024 14:55:16: Building TE index .......
INFO @ Mon, 01 Apr 2024 15:00:40: Done building TE index ......
INFO @ Mon, 01 Apr 2024 15:00:40:
Reading sample files ...
******NOT COMPLETE*******
If the BAM file is sorted by coordinates, please specify --sortByPos and re-run!
Error: 0
[Exception type: SystemExit, raised in TEtranscripts:320]
Thanks.
Hi,
Thank you for your response. Could I confirm that this is a paired-end library? The error appears to suggest that the program found an alignment for read 1, but not read 2. TEtranscripts expects all paired reads to be aligned in a one-to-one fashion (i.e. no orphan reads), and thus if there are unequal number of read 1 alignments vs read 2 alignments, it will throw this error. Which alignment program did you use for these BAM files?
Also, it appears that you're using GENCODE annotations, but using hg38 TE annotations. GENCODE has a different chromosome nomenclature for the scaffolds (non-canonical chromosomes), so if you're working with those as well, please be aware that the mismatch in chromosome names would mean that any GENCODE annotations on those scaffolds will not be annotated. Again, this might not be an issue if you only care about the canonical chromosomes (i.e. chr 1-22, X and Y)
Thanks.
Hi,
Thank you for your response. I confirm that this is data from a paired-end library.My BAM file is sorted by RNAME, and if the RNAME is identical, it is further sorted by POS. Additionally, regarding the point you raised about read1 and read2 not being equal, are you referring to the possibility that the number of reads from the treatment sample and the control sample are not equal?
Additionally, I am using the genome annotation from the UCSC database, and the TE annotations were extracted using the makeTEgtf.pl script. I believe the annotation file should be without issue.
Thanks.
Hi,
What I'm referring to is that there might be scenarios (depending on aligners) where the number of read 1 alignments are not equal to the number of read 2 alignments (e.g. read 1 maps twice, but read 2 maps in 3 locations). Typically, these are not output by STAR (the aligner we typically use for RNA-seq alignment), but others might output them. As a result, this can cause issues with TEtranscripts as it cannot find the corresponding alignment pair if they are unequal.
Could you provide the header for your BAM file? That might give us a hint if that could be an issue.
Also, are you sorting using samtools
or your own approach?
Thanks.
hi,
Here are the first 10 lines of the BAM file header content as listed:
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr1_GL383518v1_alt LN:182439
@SQ SN:chr1_GL383519v1_alt LN:110268
@SQ SN:chr1_GL383520v2_alt LN:366580
@SQ SN:chr1_KI270706v1_random LN:175055
@SQ SN:chr1_KI270707v1_random LN:32032
@SQ SN:chr1_KI270708v1_random LN:127682
@SQ SN:chr1_KI270709v1_random LN:66860
@SQ SN:chr1_KI270710v1_random LN:40176
I have used samtools to sort, and I have also attempted to convert the BAM file to a SAM file and sort it by chromosome, but both attempts resulted in the error mentioned above.
thanks
Hi,
Thank you for your response. Actually, I would like the last 10 lines of your header, as that would indicate the aligner used as well as the parameters. Also, please feel free to provide the first 10 lines of your alignments. Perhaps something there might be indicative.
Thanks.
Hi,
Thank you for your response.
The end of the header file information is as follows:
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrY_KI270740v1_random LN:37240
@PG ID:hisat2 PN:hisat2 VN:2.2.1 CL:"/public/bin/hisat2-align-s --wrapper basic-0 -q --phred33 --threads 8 --summary-file /KO_1/KO_1_pe_map_summary.txt -x /hg38/hisat2/hg38 --read-lengths 142,141,136,129,138,140,139,135,132,137,133,131,124,134,125,128,120,130,123,127,126,121,117,122,113,118,115,114,108,116,109,111,112,106,119,110,105,104,101,102,99,103,107,100,97,96,98,95,94,92,85,89,91,90,86,93,87,88,84,64,79,82,81,63,80,76,74,66,53,36,83,77,75,67,65,47,43,60,58,57,56,55,49,46,44,41,39,78,73,72,71,70,62,61,59,52,51,50,48,45,42,40,38,37 -1 /tmp/23420.inpipe1 -2 /tmp/23420.inpipe2"
@PG ID:samtools PN:samtools PP:hisat2 VN:1.14 CL:samtools view -bh -L hg38.whitelist.sort.bed -o KO_1_pe_uniq.mapq30.prop.whitelist.bam -@ 8
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.14 CL:samtools sort -@ 8 -o KO_1_pe_uniq.mapq30.prop.whitelist.srt.bam KO_1_pe_uniq.mapq30.prop.whitelist.bam
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.17 CL:samtools view -H KO_1_pe.sort.bam
The first few lines of the alignment results are as follows:
LH00391:112:2253THLT4:2:1458:10816:29086 99 chr1 14418 60 142M = 14608 332 TTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCT IIIIIIIIIIIIIIIIIIIIIII9IIIIIIIII-IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 ZS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:MD:Z:46A95 YS:i:0 YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2362:43598:4502 163 chr1 14419 60 142M = 14481 204 TTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIII AS:i:-6 ZS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:45A96 YS:i:-1 YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:1389:40959:25906 163 chr1 14439 60 142M = 14601 304 TTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGT IIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 ZS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:MD:Z:25A116 YS:i:0 YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2238:27996:19686 163 chr1 14441 60 142M = 14478 179 CTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGC II9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIII AS:i:-6 ZS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:MD:Z:23A118 YS:i:0 YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2231:11641:9363 99 chr1 14467 60 9S133M = 160864 135286 TTCTTAAGAGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII99IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-18 ZS:i:-18 XN:i:0 XM:i:0 XO:i:0 XG:i:NM:i:0 MD:Z:133 YS:i:-20 YT:Z:CP NH:i:1
Thank you very much.
Hi,
Thank you for your patience. It appears that you're using HiSat2 for alignment, which is fine. However, that program does try to align and output discordant pairs of alignment. We would recommend re-running HiSat2 with these two parameters to bypass the error that you're experiencing:
--no-mixed
By default, when hisat2 cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.
--no-discordant
By default, hisat2 looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (--fr/--rf/--ff), -I, -X. This option disables that behavior.
This would prevent HiSat2 from trying to find alignments for discordant and individual mates, which would result in read-pairs where one of the read is not aligned.
Thanks.
Hi,
"Thank you very much for your response. I will try to re-compare and see if the issue still arises."
Thanks.
Hi,
"I have now obtained a file named TEtranscripts_out.cntTable, which contains the expression levels of each gene. I need to convert it into a TPM file. However, I also need the gene lengths for TE annotation, but each gene in TEgtf has multiple transcripts. How should I determine the length of a gene?"
Thanks.
Hi,
One possible approach is to sum up the length of all potential transcripts for a particular TE subfamily (gene_id
), and use that as the "gene length".
We typically don't use TPM in our analyses, as it's not optimal for doing differential expression, but the above approach is one option.
Thanks.
Hi,
Thank you very much for your response. However, each gene_id (TE subfamily) is present on multiple chromosomes, and there is a distinction between the positive and negative strands. Can I disregard the chromosomes and the distinction of strands, and simply sum up the lengths of each gene_id (TE subfamily) to represent the total length of that gene_id (TE subfamily)?
Thanks.
Hi,
The short answer is yes. Since you're trying to "normalize" based on the possible transcripts being generated from any of the copies of the TE, you would sum up the length of all copies.
Thanks.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days
Hi
I have been trying for 3 days but still coulnd't get it work on my end. Sorry I have very little knowledge and training in bioinformatics so please bear with me with the basics. I had RNA-seq data (fastq files) that I analyzed differential expression with Salmon before. Since Salmon doesn't require alignment, I didn't have bam files for those (only fastq files), so I had to first align my fastq files using bowtie2 in order to get bam files for TEtranscripts.
To get the bam files for TEtranscripts, do I need to map the fastq file to the complete genome? or is it ok if I map it to the known canonical genes only? My colleague told me it's fine to map to the known canonical genes one because bam file will also store the unmapped reads (which could be used in TEtranscripts).
I first tried to download mm39.fa file from UCSC, and mapped my fastq file with it. Then I ran TEtranscripts with mm10_rmsk_TE.gtf and mm10.knownGene.gtf downloaded from online. However the result looks completely off, most genes don't have any read mapped to it, and the deseq result is completely different from what I got from Salmon. I realize the difference version of genome (mm39 vs mm10) may be an issue. Is this likely the case?
Then, I tried to mapped my fastq file with mm10 known canonical genes ([mm10_knownCanonical_GENCODE_VM20_highconf_final_index]), got the new bam files and ran TEtranscripts. This time, every single read becomes 0. This is why I suspect the bam file generated from only canonical genes are not compatible.