mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

basic questions about TEtranscripts #184

Closed shaymin84 closed 6 months ago

shaymin84 commented 8 months ago

Hi

I have been trying for 3 days but still coulnd't get it work on my end. Sorry I have very little knowledge and training in bioinformatics so please bear with me with the basics. I had RNA-seq data (fastq files) that I analyzed differential expression with Salmon before. Since Salmon doesn't require alignment, I didn't have bam files for those (only fastq files), so I had to first align my fastq files using bowtie2 in order to get bam files for TEtranscripts.

  1. To get the bam files for TEtranscripts, do I need to map the fastq file to the complete genome? or is it ok if I map it to the known canonical genes only? My colleague told me it's fine to map to the known canonical genes one because bam file will also store the unmapped reads (which could be used in TEtranscripts).

  2. I first tried to download mm39.fa file from UCSC, and mapped my fastq file with it. Then I ran TEtranscripts with mm10_rmsk_TE.gtf and mm10.knownGene.gtf downloaded from online. However the result looks completely off, most genes don't have any read mapped to it, and the deseq result is completely different from what I got from Salmon. I realize the difference version of genome (mm39 vs mm10) may be an issue. Is this likely the case?

  3. Then, I tried to mapped my fastq file with mm10 known canonical genes ([mm10_knownCanonical_GENCODE_VM20_highconf_final_index]), got the new bam files and ran TEtranscripts. This time, every single read becomes 0. This is why I suspect the bam file generated from only canonical genes are not compatible.

olivertam commented 8 months ago

Hi,

Thank you for your interest in the software. The short answer to all your questions is that TEtranscripts operates on genomic alignment, rather than pseudoalignment to a transcript database. You will need to align to the genome for the tool to work.

  1. You would need to align to the genome rather than genes, as TEtranscripts does not quantify by pseudoalignment. We would recommend using STAR, though it does require additional setup like building the index, and uses quite a bit of memory (at least 30-40Gb for mouse). You can try to use bowtie2, but it is much slower.
    
    # Example command on generating STAR index
    $ STAR --runMode genomeGenerate \
       --limitGenomeGenerateRAM 39000000000 \
       --runThreadN 10 \
       --genomeDir mm39_STAR_knownGenes  \
       --genomeFastaFiles mm39.fa \
       --sjdbGTFfile mm39.knownGene.gtf \
       --sjdbOverhang 150

Example command on running STAR for TEtranscripts

$ STAR --genomeDir mm39_STAR_knownGenes \ --readFilesIn R1.fastq.gz R2.fastq.gz --readFilesCommand zcat \ --outFilterMultimapNmax 100 --winAnchorMultimapNmax 150 \ --runThreadN 10 --genomeLoad NoSharedMemory \ --outSAMattributes Standard --outFilterType BySJout \ --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs \ --limitBAMsortRAM 39000000000 \ --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \ --outSAMstrandField intronMotif --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 --alignIntronMin 20 --sjdbScore 1 \ --alignIntronMax 1000000 --alignMatesGapMax 1000000



2. You definitely need to use the same genome build, as TEtranscripts compares the genomic position of your read to the location of the gene/TE. Different versions of the mouse genome are not directly comparable. There are mm39 [TE]() and [knownGene](https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.knownGene.gtf.gz) GTF that could be used instead.

Hope this is helpful.
shaymin84 commented 8 months ago

Hi

Thank you so much for the prompt response! I've done some research online and read through the discussion in closed threads here to learn more. I've only done alignment to transcriptome (not the whole genome) before, and only did it with bowtie2. Seems like it's more complicated than I thought. My main questions now are 2:

  1. I've figured out how to download mm10 genome and index it. But from what I saw online, bowtie2 seems to not take into consideration the introns, so many reads would not be mapped. Does this make bowtie2 unusable for DNA genome alignment? Im trying using STAR as you suggest now. But when I ran the code you suggested to index mm39 for STAR, is showed an error std::bad_alloc. The cluster Im using should still have 4T of memory. Do you know why it shows the memory is running out while it clearly isn't?

  2. in both TEcounts and TEtranscripts --GTF (for the coding genes) is a required argument. If I only care about transposon elements (protein coding genes had been carried out by Salmon), could I just put a random file in --GTF, ignore the results for all protein coding genes, and only see the TE results?

olivertam commented 8 months ago

Hi,

  1. Bowtie2 can deal with some gapped reads, but it is certainly not designed for heavily spliced reads. That's why we are recommending STAR, which is designed for this.
  2. Could you provide the full log output of the STAR index run? It will provide more information. Did you request the required amount of memory from the cluster (if you're submitting this to the cluster)?
  3. It is important to use both annotations, as quantification of the TE does depend on whether it overlaps a gene or not. For example, if you have a read that maps to a genomic region that is both a gene exon and an Alu sequence, depending on whether it is the only location in the genome where it was found (what we call unique mappers), or also mapping to other genomic locations (what we call multi mappers), it will preferentially consider that as gene-derived (if former) and TE-derived (if latter). This is based on the idea that genes are more "unique" in the genome, while TE (especially younger TE) are more likely to appear in multiple regions in the genome with highly similar sequences. Thus, you will probably see slight differences in gene counts between the two, but most protein-coding genes (e.g. GAPDH and ACTB) should be similar.

Happy to answer any more questions and help with troubleshooting.

shaymin84 commented 8 months ago

Hi

Thank you so much again for such a prompt response! That is very helpful.

When I tried to map with Bowtie2, turns out ~35% of reads are mapped once, ~30% mapped more than once. ~30% didn't map at all. Does this number look too low for you, thus not worth proceeding with TEtrancripts?

and yes, the problem was I didn't request for memory from the cluster. I just figured how to do this and it seems fine now.

Sorry again for these questions, Im quite new to this with very little training.

olivertam commented 8 months ago

Hi,

The Bowtie2 results are on the low side, so I would probably map with STAR and see if it's similar or higher than Bowtie2. And I'm glad to hear that the indexing is now working. It might take a while, but should not take more than 12 hours.

Thanks.

shaymin84 commented 8 months ago

Hi

the index is working and successfully done. I tried to run the second part of the code you suggested (Example command on running STAR for TEtranscripts) to do the alignment, but if failed with "fatal error".

The slurm file states "ReadAlignChunk_processChunks.cpp:204:processChunks EXITING because of FATAL ERROR in input reads: wrong read ID line format: the read ID lines should start with @ or > " "SOLUTION: verify and correct the input read files" "FATAL ERROR, exiting"

The input I gave were normal fa.gz files I got from sequencing, 2 files since it was pair-end sequencing. Do you know what parameters may be wrong?

----Edit---- I figured it may be due to the fasta file compression issue. I added --readFilesCommand zcat and it seems to work now

olivertam commented 8 months ago

Hi,

Yes, you need the --readFilesCommand zcat for fa.gz files. Hopefully it works now.

Thanks.

smarhon commented 8 months ago

Hi,

Can TEtranscript be applied for locus-wise TE analysis? I modified the GTF file to make the loci distinct and used it with the tool, but it is taking very long time in calculating the index.

olivertam commented 8 months ago

Hi,

We recommend TElocal, which is probably what you are looking for. There are pre-built indices here, as you correctly determined that building the index from scratch takes a long time.

Thanks.

smarhon commented 8 months ago

Hi,

We recommend TElocal, which is probably what you are looking for. There are pre-built indices here, as you correctly determined that building the index from scratch takes a long time.

Thanks.

Thank you for your prompt response.

Do I do the analysis for each bam file/sample separately?

olivertam commented 8 months ago

Hi,

Yes. It is in essence functioning as TEcount in the TEtranscript software, except it does locus-level quantification. You can run TElocal for each sample in parallel on the cluster, and then join the output (e.g. with something like multijoin.pl). You will have to make sure that the --project parameter is used, or different runs will overwrite the output files. Here's an example of how you could batch process the BAM files:

$ cat qsub_TElocal.sh
if [ -z "$1" ]; then
    echo "Usage: $0 [BAM files ...]" >&2
    exit 1
fi

for FILE in "$@"
do
    PROJ=$(basename ${FILE} \.bam)
    CMD="qsub -b y TElocal --stranded reverse --project ${PROJ} --sortByPos -b ${FILE} --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd"
    echo "${CMD}"
    ${CMD}
done

$ qsub_TElocal.sh test1.bam test2.bam test3.bam
qsub -b y TElocal --stranded reverse --project test1 --sortByPos -b test1.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 1 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test2 --sortByPos -b test2.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 2 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test3 --sortByPos -b test3.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 3 ("TElocal") has been submitted

Let me know if you have further questions.

Thanks.

smarhon commented 8 months ago

Hi,

Yes. It is in essence functioning as TEcount in the TEtranscript software, except it does locus-level quantification. You can run TElocal for each sample in parallel on the cluster, and then join the output (e.g. with something like multijoin.pl). You will have to make sure that the --project parameter is used, or different runs will overwrite the output files. Here's an example of how you could batch process the BAM files:

$ cat qsub_TElocal.sh
if [ -z "$1" ]; then
    echo "Usage: $0 [BAM files ...]" >&2
    exit 1
fi

for FILE in "$@"
do
    PROJ=$(basename ${FILE} \.bam)
    CMD="qsub -b y TElocal --stranded reverse --project ${PROJ} --sortByPos -b ${FILE} --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd"
    echo "${CMD}"
    ${CMD}
done

$ qsub_TElocal.sh test1.bam test2.bam test3.bam
qsub -b y TElocal --stranded reverse --project test1 --sortByPos -b test1.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 1 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test2 --sortByPos -b test2.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 2 ("TElocal") has been submitted
qsub -b y TElocal --stranded reverse --project test3 --sortByPos -b test3.bam --GTF mm10_knownGene.gtf --TE mm10_rmsk_TE.gtf.locInd
Your job 3 ("TElocal") has been submitted

Let me know if you have further questions.

Thanks.

Thank you so much.

If I want to retrieve the genomic coordinates (chromosome, start, end) of each locus that appear in the genomic table, is there a TE GTF file where can I match the ids of the loci (which appear like this HAL1_dup14731:HAL1:L1:LINE in the count table ) that I got in the count table with the genomic coordinates in that GTF file?

I also found these files in the lab webpage and they include TE GTF files used with TEtranscripts. Are they they same? Is the index was made from these GTF files?

https://www.dropbox.com/sh/1ppg2e0fbc64bqw/AACUXf-TA1rnBIjvykMH2Lcia?dl=0

olivertam commented 8 months ago

Hi,

You can find a table for the coordinates here.

To answer your question, yes, the same GTF files used in TEtranscripts are also used to generate the TElocal pre-built indices.

Thanks.

smarhon commented 8 months ago

Hi, We recommend TElocal, which is probably what you are looking for. There are pre-built indices here, as you correctly determined that building the index from scratch takes a long time. Thanks.

Thank you for your prompt response.

Do I do the analysis for each bam file/sample separately?

smarhon commented 8 months ago

Hi,

You can find a table for the coordinates here.

To answer your question, yes, the same GTF files used in TEtranscripts are also used to generate the TElocal pre-built indices.

Thanks.

Thank you so much.

I ran the tool and everything is fine, but I get like message below . Is it a concern ?

[E::idx_find_and_load] Could not retrieve index file for '.1711053357.7101445bam'

Thanks

olivertam commented 8 months ago

Hi,

This is a pysam warning message, and has no impact on our software.

Thanks

zhangsuaiqi commented 7 months ago

Hi, "I am encountering an issue with this software. Regardless of whether I include the --sortByPos parameter or not, I still face this problem. I have also tried to reorder the BAM file according to coordinates, but the error persists. Why is this happening?"

If the BAM file is sorted by coordinates, please specify --sortByPos and re-run! Error: 0 [Exception type: SystemExit, raised in TEtranscripts:320]

Thanks

olivertam commented 7 months ago

Hi,

Thank you for your interest in the software. Could you provide the full log output?

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

The following is the complete log output.

INFO  @ Mon, 01 Apr 2024 14:40:07:
#ARGUMENTS LIST:
#name = TEtranscripts_out
#treatment files = ['KO_1_chr.bam']
#control files = ['WT_1_chr.bam']
#GTF file = gencode.v34.annotation.gtf
#TE file = hg38.RepeatMaser.sorted.gtf
#multi-mapper mode = multi
#stranded = no
#differential analysis using DESeq2
#normalization = DESeq2_default
#FDR cutoff = 5.00e-02
#fold-change cutoff =  1.00
#read count cutoff = 1
#number of iteration = 100
#Alignments grouped by read ID = True

INFO  @ Mon, 01 Apr 2024 14:40:07: Processing GTF files ...

INFO  @ Mon, 01 Apr 2024 14:40:07: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
INFO  @ Mon, 01 Apr 2024 14:54:50: Done building gene index ......

INFO  @ Mon, 01 Apr 2024 14:55:16: Building TE index .......

INFO  @ Mon, 01 Apr 2024 15:00:40: Done building TE index ......

INFO  @ Mon, 01 Apr 2024 15:00:40:
Reading sample files ...

******NOT COMPLETE*******
If the BAM file is sorted by coordinates,                                 please specify --sortByPos and re-run!
Error: 0
[Exception type: SystemExit, raised in TEtranscripts:320]

Thanks.

olivertam commented 7 months ago

Hi,

Thank you for your response. Could I confirm that this is a paired-end library? The error appears to suggest that the program found an alignment for read 1, but not read 2. TEtranscripts expects all paired reads to be aligned in a one-to-one fashion (i.e. no orphan reads), and thus if there are unequal number of read 1 alignments vs read 2 alignments, it will throw this error. Which alignment program did you use for these BAM files?

Also, it appears that you're using GENCODE annotations, but using hg38 TE annotations. GENCODE has a different chromosome nomenclature for the scaffolds (non-canonical chromosomes), so if you're working with those as well, please be aware that the mismatch in chromosome names would mean that any GENCODE annotations on those scaffolds will not be annotated. Again, this might not be an issue if you only care about the canonical chromosomes (i.e. chr 1-22, X and Y)

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

Thank you for your response. I confirm that this is data from a paired-end library.My BAM file is sorted by RNAME, and if the RNAME is identical, it is further sorted by POS. Additionally, regarding the point you raised about read1 and read2 not being equal, are you referring to the possibility that the number of reads from the treatment sample and the control sample are not equal?

Additionally, I am using the genome annotation from the UCSC database, and the TE annotations were extracted using the makeTEgtf.pl script. I believe the annotation file should be without issue.

Thanks.

olivertam commented 7 months ago

Hi,

What I'm referring to is that there might be scenarios (depending on aligners) where the number of read 1 alignments are not equal to the number of read 2 alignments (e.g. read 1 maps twice, but read 2 maps in 3 locations). Typically, these are not output by STAR (the aligner we typically use for RNA-seq alignment), but others might output them. As a result, this can cause issues with TEtranscripts as it cannot find the corresponding alignment pair if they are unequal.

Could you provide the header for your BAM file? That might give us a hint if that could be an issue.

Also, are you sorting using samtools or your own approach?

Thanks.

zhangsuaiqi commented 7 months ago

hi,

Here are the first 10 lines of the BAM file header content as listed:

@HD VN:1.0  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr1_GL383518v1_alt  LN:182439
@SQ SN:chr1_GL383519v1_alt  LN:110268
@SQ SN:chr1_GL383520v2_alt  LN:366580
@SQ SN:chr1_KI270706v1_random   LN:175055
@SQ SN:chr1_KI270707v1_random   LN:32032
@SQ SN:chr1_KI270708v1_random   LN:127682
@SQ SN:chr1_KI270709v1_random   LN:66860
@SQ SN:chr1_KI270710v1_random   LN:40176

I have used samtools to sort, and I have also attempted to convert the BAM file to a SAM file and sort it by chromosome, but both attempts resulted in the error mentioned above.

thanks

olivertam commented 7 months ago

Hi,

Thank you for your response. Actually, I would like the last 10 lines of your header, as that would indicate the aligner used as well as the parameters. Also, please feel free to provide the first 10 lines of your alignments. Perhaps something there might be indicative.

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

Thank you for your response.

The end of the header file information is as follows:

@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrY_KI270740v1_random   LN:37240
@PG ID:hisat2   PN:hisat2   VN:2.2.1    CL:"/public/bin/hisat2-align-s --wrapper basic-0 -q --phred33 --threads 8 --summary-file /KO_1/KO_1_pe_map_summary.txt -x /hg38/hisat2/hg38 --read-lengths 142,141,136,129,138,140,139,135,132,137,133,131,124,134,125,128,120,130,123,127,126,121,117,122,113,118,115,114,108,116,109,111,112,106,119,110,105,104,101,102,99,103,107,100,97,96,98,95,94,92,85,89,91,90,86,93,87,88,84,64,79,82,81,63,80,76,74,66,53,36,83,77,75,67,65,47,43,60,58,57,56,55,49,46,44,41,39,78,73,72,71,70,62,61,59,52,51,50,48,45,42,40,38,37 -1 /tmp/23420.inpipe1 -2 /tmp/23420.inpipe2"
@PG ID:samtools PN:samtools PP:hisat2   VN:1.14 CL:samtools view -bh -L hg38.whitelist.sort.bed -o KO_1_pe_uniq.mapq30.prop.whitelist.bam -@ 8
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.14 CL:samtools sort -@ 8 -o KO_1_pe_uniq.mapq30.prop.whitelist.srt.bam KO_1_pe_uniq.mapq30.prop.whitelist.bam
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.17 CL:samtools view -H KO_1_pe.sort.bam

The first few lines of the alignment results are as follows:

LH00391:112:2253THLT4:2:1458:10816:29086    99  chr1    14418   60  142M    =   14608   332 TTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCT  IIIIIIIIIIIIIIIIIIIIIII9IIIIIIIII-IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6 ZS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:46A95 YS:i:0  YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2362:43598:4502 163 chr1    14419   60  142M    =   14481   204 TTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIII  AS:i:-6 ZS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:45A96  YS:i:-1 YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:1389:40959:25906    163 chr1    14439   60  142M    =   14601   304 TTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGT  IIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6 ZS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:25A116    YS:i:0  YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2238:27996:19686    163 chr1    14441   60  142M    =   14478   179 CTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGC  II9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIII  AS:i:-6 ZS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:23A118    YS:i:0  YT:Z:CP NH:i:1
LH00391:112:2253THLT4:2:2231:11641:9363 99  chr1    14467   60  9S133M  =   160864  135286  TTCTTAAGAGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII99IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-18    ZS:i:-18    XN:i:0  XM:i:0  XO:i:0  XG:i:NM:i:0 MD:Z:133    YS:i:-20    YT:Z:CP NH:i:1

Thank you very much.

olivertam commented 7 months ago

Hi,

Thank you for your patience. It appears that you're using HiSat2 for alignment, which is fine. However, that program does try to align and output discordant pairs of alignment. We would recommend re-running HiSat2 with these two parameters to bypass the error that you're experiencing:

--no-mixed
    By default, when hisat2 cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.

--no-discordant
    By default, hisat2 looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (--fr/--rf/--ff), -I, -X. This option disables that behavior.

This would prevent HiSat2 from trying to find alignments for discordant and individual mates, which would result in read-pairs where one of the read is not aligned.

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

"Thank you very much for your response. I will try to re-compare and see if the issue still arises."

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

"I have now obtained a file named TEtranscripts_out.cntTable, which contains the expression levels of each gene. I need to convert it into a TPM file. However, I also need the gene lengths for TE annotation, but each gene in TEgtf has multiple transcripts. How should I determine the length of a gene?"

Thanks.

olivertam commented 7 months ago

Hi,

One possible approach is to sum up the length of all potential transcripts for a particular TE subfamily (gene_id), and use that as the "gene length". We typically don't use TPM in our analyses, as it's not optimal for doing differential expression, but the above approach is one option.

Thanks.

zhangsuaiqi commented 7 months ago

Hi,

Thank you very much for your response. However, each gene_id (TE subfamily) is present on multiple chromosomes, and there is a distinction between the positive and negative strands. Can I disregard the chromosomes and the distinction of strands, and simply sum up the lengths of each gene_id (TE subfamily) to represent the total length of that gene_id (TE subfamily)?

Thanks.

olivertam commented 7 months ago

Hi,

The short answer is yes. Since you're trying to "normalize" based on the possible transcripts being generated from any of the copies of the TE, you would sum up the length of all copies.

Thanks.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days