mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
218 stars 29 forks source link

Lots of unannotated reads #22

Closed MonkeySylvia closed 6 years ago

MonkeySylvia commented 6 years ago

Hi, I'm counting gene/TE expression in zebrafish. I used hisat2 to map my reads to the genome (danRer10) and then used the bam file to run TEtranscript. I used the zebrafish gtf file provided from this website as my reference gtf. However, lots of reads are classified as 'unannotated reads'. Should I concern for this issue? Thanks! Sylvia

my code nohup TEtranscripts --project first_try_tetranscripts --GTF ../../reference/danRer10_uscs.gtf --TE danRer10_repeatmasker_table_v30309.bed -c 126_B_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 126_C_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 126_D_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 126_E_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 126_F_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam -t 127_B_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 127_C_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 127_D_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 127_E_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam 127_F_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam --sortByPos > first_try_tetranscripts.log.txt &

and one of the output In library 126_D_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam: Total annotated reads = 552211.594767 Total non-uniquely mapped reads = 462588 Total unannotated reads = 6343265

olivertam commented 6 years ago

Hi Sylvia, The proportion is higher than what we usually see for other samples. What are the Gene and TE counts for those libraries? I noticed that you are using a TE BED file (or at least a file with the BED extension) rather than a GTF file. Not sure if that makes any difference. Also, is this a stranded library? What I mean is whether read 1 corresponds to the direction of the mRNA transcript, or the reverse complement of the transcript, or whether the library had no strand bias? Thanks

MonkeySylvia commented 6 years ago

Hi Oliver, My library is stranded as a normal Illumina library. Gene counts for that sample is 134302 and 414029 for TEs. And sorry, I think I provided the wrong command for my TE bed file, I actually used the one you provided nohup TEtranscripts --project blas_vs_hat --GTF ../../reference/danRer10_uscs.gtf --TE danRer10_rmsk_TE.gtf I'm still not sure if i should concern for this issue.

olivertam commented 6 years ago

Hi Sylvia Might I recommend running one treatment and one control library using the --stranded reverse and/or --stranded no options? Without knowing the exact RNAseq protocol used, it might be a good idea to see whether the library is reverse-stranded (especially if you're using the TruSeq kit). That way, you can compare to the one that you ran, and see if it significantly increases the Gene counts. Thanks.

MonkeySylvia commented 6 years ago

Hi Oliver, I tried --strand reverse and got much better annotations! Here is my result In library 126_D_1.fastq.clean.fastq.clean.paired.qc.fastq.k100.sam.k100.sorted.bam: Total annotated reads = 5914556.27231 Total non-uniquely mapped reads = 462588 Total unannotated reads = 873972 Thank you!!