mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
218 stars 29 forks source link

TE.gtf fails to generate index --> no TE diff expression output #12

Closed bdweave closed 6 years ago

bdweave commented 7 years ago

Hello,

I have an issue that results in a full run of the TEtranscripts package, but only outputs differential gene expression.

My workflow has consisted of using STAR aligner, saving 100 multi mappers, and using these bam files to compare two conditions of hESC RNA-expression. I have downloaded and used the custom human_TE.gtf file located in your linked files - although I renamed the file for convenience to hg_rmsk_TE_Hammell.gtf. The human_gene.gtf is simply a UCSC downloaded gene file.

Here are the settings used when running the command: ~/.local/bin/TEtranscripts \ --format BAM \ --mode multi \ -t ./DuxA/gfp_pos/BAMs/DuxA_1.bam ./DuxA/gfp_pos/BAMs/DuxA_2.bam ./DuxA/gfp_pos/BAMs/DuxA_3.bam ./DuxA/gfp_pos/BAMs/DuxA_4.bam \ ./DuxA/gfp_pos/BAMs/DuxA_5.bam ./DuxA/gfp_pos/BAMs/DuxA_6.bam \ -c ./mCherry_control/gfp_pos/BAMs/mCherry_2_s1.bam ./mCherry_control/gfp_pos/BAMs/mCherry_3.bam ./mCherry_control/gfp_pos/BAMs/mCherry_4.bam \ ./mCherry_control/gfp_pos/BAMs/mCherry_5.bam ./mCherry_control/gfp_pos/BAMs/mCherry_6.bam \ --GTF /tomato/dev/data/Human/GRCh38.p7/Homo_sapiens.GRCh38.87.gtf \ --TE /tomato/dev/job/BradW/PRD_RNAseq_analysis/Hammell_ref_materials/hg38_rmsk_TE_Hammell.gtf --project DUXA.TEtranscripts

Output from this run:

INFO @ Tue, 04 Apr 2017 09:46:44: ARGUMENTS LIST: name = DUXA.TEtranscripts treatment files = ['./DuxA/gfp_pos/BAMs/DuxA_1.bam', './DuxA/gfp_pos/BAMs/DuxA_2.bam', './DuxA/gfp_pos/BAMs/DuxA_3.bam', './DuxA/gfp_pos/BAMs/DuxA_4.bam', './DuxA/gfp_pos/BAMs/DuxA_5.bam', './DuxA/gfp_pos/BAMs/DuxA_6.bam'] control files = ['./mCherry_control/gfp_pos/BAMs/mCherry_2_s1.bam', './mCherry_control/gfp_pos/BAMs/mCherry_3.bam', './mCherry_control/gfp_pos/BAMs/mCherry_4.bam', './mCherry_control/gfp_pos/BAMs/mCherry_5.bam', './mCherry_control/gfp_pos/BAMs/mCherry_6.bam'] GTF file = /tomato/dev/data/Human/GRCh38.p7/Homo_sapiens.GRCh38.87.gtf TE file = /tomato/dev/job/BradW/PRD_RNAseq_analysis/Hammell_ref_materials/hg38_rmsk_TE_Hammell.gtf multi-mapper mode = multi stranded = yes normalization = DESeq_default (rpm: Reads Per Million mapped; quant: Quantile normalization) FDR cutoff = 5.00e-02 fold-change cutoff = 1.00 read count cutoff = 1 number of iteration = 10 Alignments grouped by read ID = True

INFO @ Tue, 04 Apr 2017 09:46:44: Processing GTF files ...

INFO @ Tue, 04 Apr 2017 09:46:44: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. 400000 GTF lines processed. 500000 GTF lines processed. 600000 GTF lines processed. 700000 GTF lines processed. 800000 GTF lines processed. 900000 GTF lines processed. 1000000 GTF lines processed. 1100000 GTF lines processed. INFO @ Tue, 04 Apr 2017 09:57:18: Done building gene index ......

INFO @ Tue, 04 Apr 2017 09:57:46: Building TE index ....... INFO @ Tue, 04 Apr 2017 10:05:04: Done building TE index ...... INFO @ Tue, 04 Apr 2017 10:05:04: Reading sample files ...

uniq te counts = 0.0 .......start iterative optimization .......... TE counts total 0.0 Gene counts total 411254.705556

In library ./DuxA/gfp_pos/BAMs/DuxA_1.bam: Total annotated reads = 411254.705556 Total non-uniquely mapped reads = 640 Total unannotated reads = 413971

uniq te counts = 0.0 .......start iterative optimization .......... TE counts total 0.0 Gene counts total 364145.324242


and so on...

Ultimately, my output files from this run consist, solely of rows populated with ENSG0000****, which have corresponding human gene* symbols, leading me to believe that the TE index was never created and never compared with respect to my BAM files.

Are there any ways to ensure that the TE index is created and used for comparison?

Thanks, Brad

olivertam commented 7 years ago

Hi. Looking at the log, it appears that there were no reads assigned to TE, which is very strange. I don't know if this is due to the TE index not being created (though it appears that it should be), or if the intersection between the BAM and TE index was not working. What is also unusual is that there were very few non-uniquely mapped reads in the DuxA_1.bam file according to TEtranscripts (<1000 for a library of 400,000 reads), which would also suggest a low amount of repetitive sequences detected (I think that number is calculated independently of whether they annotate as TE). What is in the header of the BAM files (in particular, the @PG line)? Were the BAM files sorted by coordinate, by name, or unsorted? Were these paired end reads? Sorry for the many questions, and thanks for your interest in the tool.

bdweave commented 7 years ago

The @PG line is as follows: @PG ID:STAR PN:STAR VN:STAR_2.5.2b CL:/uufs/chpc.utah.edu/common/home/hcibcore/atlatl/app/STAR/2.5.2b/STAR --runMode alignReads --runThreadN 32 --genomeDir /uufs/chpc.utah.edu/common/home/hcibcore/atlatl/data/Human/GRCh38.p7/star50 --readFilesIn ERR1223910.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outWigType bedGraph --outWigStrand Unstranded --outFilterMultimapNmax 100 --clip3pAdapterSeq AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --winAnchorMultimapNmax 100 --twopassMode Basic

The BAM files are sorted by coordinate, and these are single-end 50bp reads.

Thanks for your help in troubleshooting.

As far as a low amount of repetitive sequences in my data - this is true relative to the genic counts - but I do have enough TE counts for DESeq2 analysis and statistics without using your package, yet still using your custom GTF, and instead using featureCounts to include fractional counting of multi mappers.

Thanks, Brad

olivertam commented 7 years ago

Hi Brad, I don't know if this is the cause, but I noticed a possible issue. In your previous comment, you mentioned:

The BAM files are sorted by coordinate, and these are single-end 50bp reads.

However, in the log (note last line of copied text):

INFO @ Tue, 04 Apr 2017 09:46:44:
ARGUMENTS LIST:
...
number of iteration = 10
Alignments grouped by read ID = True

This means that the program thinks the BAM file is sorted by readID (or unsorted), whereas they are actually sorted by coordinate. This could certainly result in unusual behavior (perhaps similar to what you have noticed).

There are two things you can try:

  1. Add the --sortByPos parameter to your TEtranscripts command line. However, if you are using pysam version 0.9 or higher and/or samtools version 1.3 or higher, the current version of TEtranscripts cannot handle the new command line used by samtools sort
  2. Sort the BAM files by read name using samtools sort, and then run TEtranscripts as before. samtools sort -n -o [output BAM] [input BAM] if using samtools version 1.3 or higher

Hopefully this might help resolve the issue. Thanks again.