mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

multi-threading #17

Closed retrogenomics closed 6 years ago

retrogenomics commented 6 years ago

Hi, I'm wondering if there is an easy way to use multi-threading to run TEtranscript. I haven't found any option for that. The counting process is very slow for large RNA-seq files and this would speed up the entire process. Thanks in advance for your help

olivertam commented 6 years ago

Thank you for your feedback. We are in the process of getting the counting subroutine separated out as a tool so that it can be run on each BAM files separately (somewhat parallelizing the process). As for making TEtranscripts a truly multi-threaded process, we are still working on that.

retrogenomics commented 6 years ago

That would be a fantastic practical improvement ! I look forward to testing it.

vasilislenis commented 6 years ago

Hi Oliver,

I find this approach extremely useful since I am working on a cluster with time limit and currently a 74 samples job is crushing. So, I was wondering, since the counting subroutine is still in progress, is there an existing "hack" such as a flag to enable and resubmit the job from the point that it crushed? Of course, I am open to any other alternative.

Thank you very much in advance, Vasilis.

olivertam commented 6 years ago

Hi Vasilis,

One "hack" is to run multiple TEtranscripts, doing a handful of samples at a time. The idea is to generate the counts table, and then combine them at the end to do the differential analysis.

For example, if you have ten files, instead of running a TEtranscripts run with all ten, you can run 5 TEtranscripts, each handling two files (one as a "treatment", and one as a "control"). While the differential analysis is not meaningful, you should be able to get the counts for each sample in the .cntTable file, which you can cut and paste as required to generate your final count table. You can then do differential analysis on that large file.

Hope this is helpful.

Cheers, Oliver

vasilislenis commented 6 years ago

Hi Oliver,

Thank you very much for your reply! Yes, definitely it is helpful! I had something similar in my mind but I forgot to take a look at the .cntTable file.

Many thanks! Vasilis.

wanisajad commented 2 months ago

I am revisiting the multi-threading issue (#17). A workaround was suggested for running TEtranscripts by processing samples in batches. I'm curious to know if there have been any developments or updates since then that support true multi-threading, or should users continue with the batch processing method? Thanks Sajad

olivertam commented 2 months ago

Hi,

We have decided that running TEcount (included with TEtranscripts) on each sample in parallel is more flexible, as it allows user to post-processing the count tables, such as joining samples in different combinations or using differential analysis pipeline other than DESeq2, according to their specific needs.

Thanks.

wanisajad commented 2 months ago

Hi Oliver, Thanks for your response. You mean like each handling two files (one as a "treatment", and one as a "control"). Thanks Sajad

olivertam commented 2 months ago

Hi,

I meant running TEcount on each file separately. E.g.

TEcount --sortByPos --format BAM --mode multi -b sample1.bam --GTF gene_annot.gtf --TE te_annot.gtf --project sample1

You will then get a single count table (sample1.cntTable in this example) for each run. You can then join the count tables together (e.g. with multijoin.pl) to make your final count table for downstream analysis.

Thanks.

wanisajad commented 2 months ago

Thanks Oliver, That is helpfull Sajad

wanisajad commented 3 weeks ago

Hi, Thanks for the advice above. I was able to run the pipelines successfully I have some questions. I would appreciate your response on these.

I have been using STAR with settings optimized for multi-mapping for analysis with TEtranscripts/TEcount, focusing on transposable elements. to my understanding both "multi-mapped reads" and "unmapped: other" reads might originate from repetitive elements, I wanted to clarify how Etranscripts/TEcount processes these reads:

I have noted differences in mapping statistics between my RNA-seq ( read length is 275) and ChIP-seq ( read length is 2100 ) data concerning multi mapped and unmapped reads. Specifically, RNA-seq shows a substantial percentage of reads mapped to multiple loci (approximately 32-36%) and a negligible percentage of reads classified as "unmapped: other." Conversely, ChIP-seq data display lower percentages of reads mapped to multiple loci (about 2-6%) and higher percentages of "unmapped: other" (approximately 7-20%), with H3K9me3 samples showing the highest values.

  1. Could you clarify which category of reads—those mapped to multiple loci or those unmapped: other— or both are likely derived from highly repetitive genomic regions? ( I have posted this question for STAR developer also)
  2. Does TEtranscripts use reads labeled as "% of reads mapped to multiple loci" or those unmapped: other— or both in STAR outputs for its analyses?
  3. Are reads marked as "unmapped: other" in STAR outputs utilized in any part of TEtranscripts analysis, or are they completely disregarded?

Thanks Sajad

On Thu, Apr 18, 2024 at 1:16 PM Oliver Tam @.***> wrote:

Hi,

I meant running TEcount on each file separately. E.g.

TEcount --sortByPos --format BAM --mode multi -b sample1.bam --GTF gene_annot.gtf --TE te_annot.gtf --project sample1

You will then get a single count table (sample1.cntTable in this example) for each run. You can then join the count tables together (e.g. with multijoin.pl https://github.com/agordon/bin_scripts/blob/master/scripts/multijoin.pl) to make your final count table for downstream analysis.

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/TEtranscripts/issues/17#issuecomment-2064622519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJOAZEVVDMNHHDICZOL26DY575XXAVCNFSM4ETBBZJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGQ3DEMRVGE4Q . You are receiving this because you commented.Message ID: @.***>

olivertam commented 3 weeks ago

Hi,

Are you using STAR for both analyses? We typically don't use STAR for ChIP-seq, but I can't see why it might not work.

I don't think the unmapped: other is always related to TE, as they are considered unmapped reads. These could be contamination, non-reference DNA (which could include some TE insertion if they are very different in sequence to those found in the genome, but are could also be things like Epstein-Barr virus) or adapters (if they were not removed prior to mapping). It is therefore not surprising if there are more unmapped: other in the ChIP-seq dataset if they are contributed by non-reference DNA.

Also, increasing the read length would certainly lower multi-mapping, as it's easier to find an unambiguous alignment if given sufficient sequence length to work with.

To directly answer your question:

  1. mapped to multiple loci refers to reads that map to multiple genomic location (up to the limit provided by the --outFilterMultimapNmax parameter). Those that align more than that limit go into the mapped to too many loci category (though they are treated as unmapped for the purposes of alignment). Thus, it would be the mapped to multiple loci that are more likely derived from highly repetitive sequences in the reference genome
  2. The unmapped: other category is not used by TEtranscripts, either because it is not outputted in the BAM file (unless --outSAMunmapped is provided), or it does not have an alignment that could be used for quantification. In contrast, the mapped to multiple loci alignments are used by TEtranscripts
  3. As described above, the unmapped: other are ignored

Thanks.

wanisajad commented 3 weeks ago

Hi, Thank you for your quick response. For RNA-seq, I have been consistently using STAR, and for ChIP-seq, I am comparing outputs using BWA, Bowtie2, and STAR. My parameters for retaining multi mappers --outFilterMultimapNmax 5000 --outMultimapperOrder Random --winAnchorMultimapNmax 5000 --alignIntronMax 1 These settings result in a distinct difference in the percentage of reads mapped to multiple loci between RNA-seq and ChIP-seq datasets. For instance, in RNA-seq, 32.50% of reads mapped to multiple loci and 42.35% in Uniquely mapped reads. As you suggested it is 32.50% of reads mapped to multiple loci which are used by TEtranscript. While as in ChIP-seq, only 5.23% of reads fall into mapped to multiple loci category and 72.12% in Uniquely mapped reads category. Given this context, what aligner do you recommend and what specific settings for ChIP-seq that would optimize the retention and analysis of multi-mapped reads? Thanks Sajad

olivertam commented 3 weeks ago

Hi,

There are biological differences between what you are measuring in ChIP-seq and RNA-seq experiments. Depending on what you're ChIP-ping, you might be enriching for more "unique" regions of the genome (e.g. H3K4me3 at promoters, which might not be as repetitive). Thus, you can't simply compare the "mappability" of two datasets from different experimental approaches. However, you could probably compare and contrast between STAR, Bowtie2 and BWA on how they map/treat multimappers of your ChIP-seq libraries.

If you're setting the same mapping thresholds when aligning ChIP-seq data, then you are allowing the same "retention" of multi-mapped reads. However, there will be inherent differences in mappability of RNA-seq and ChIP-seq that are largely not attributable to the algorithms.

Thanks

wanisajad commented 3 weeks ago

Hi,

Thanks for your explanation.

I have got H3K27ac and H3K9me3 ChIP Seq data. As H3K9me3 is associated with heterochromatin I expect more % of reads mapped to multiple loci in it compared to H3K27ac.

Here is the mapping % of one sample STAR output under multi mapping condition. My concern is the large % of reads unmapped: other in H3K9me3

H3K27ac

H3K9me3

Uniquely mapped reads % |

75.64%

58.87%

% of reads mapped to multiple loci |

4.19%

6.31%

% of reads unmapped: too many mismatches |

9.07%

9.28%

% of reads unmapped: too short |

3.20%

4.65%

% of reads unmapped: other |

7.88%

20.89%

Thanks

On Thu, Jun 6, 2024 at 7:29 PM Oliver Tam @.***> wrote:

Hi,

There are biological differences between what you are measuring in ChIP-seq and RNA-seq experiments. Depending on what you're ChIP-ping, you might be enriching for more "unique" regions of the genome (e.g. H3K4me3 at promoters, which might not be as repetitive). Thus, you can't simply compare the "mappability" of two datasets from different experimental approaches. However, you could probably compare and contrast between STAR, Bowtie2 and BWA on how they map/treat multimappers of your ChIP-seq libraries.

If you're setting the same mapping thresholds when aligning ChIP-seq data, then you are allowing the same "retention" of multi-mapped reads. However, there will be inherent differences in mappability of RNA-seq and ChIP-seq that are largely not attributable to the algorithms.

Thanks

— Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/TEtranscripts/issues/17#issuecomment-2153559471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJOAZBZWSJ2SQ3V37VKCADZGDWGTAVCNFSM4ETBBZJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJVGM2TKOJUG4YQ . You are receiving this because you commented.Message ID: @.***>

olivertam commented 3 weeks ago

Hi,

Thanks for sharing this. Are you aligning just to the canonical chromosomes (i.e. chr1-22, X, Y), or are you also aligning to the scaffolds? We typically prefer to align to the canonical chromosomes and scaffolds (what GENCODE called "primary assembly"). If you're not aligning to the scaffolds, that could be why some of your reads are "unmapped".

If you want to trouble-shoot this further, I would recommend outputting the H3K9me3 unmapped reads (--outReadsUnmapped Fastx) and determine if they are human in origin (e.g. via BLAST/BLAT). You can then compare and contrast that with your H3K27ac unmapped reads to see if there's a population that is enriched. This might give you more insights into what the difference might (e.g. the H3K9me3 ChIP was low-yield, so you mainly seqeunced spike-ins).

Thanks