mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Alignment Parameters #157

Closed bshim181 closed 7 months ago

bshim181 commented 8 months ago

Hello,

I am getting my attempt in TE quantification from paired rna seq data( 150bp). I was wondering if there is a recommended or a default set up for STAR alignment. What would you recommend to start out with before jumping into TE quantification using TEtranscripts?

olivertam commented 8 months ago

Hi,

Thank you for your interest in the software. I am reproducing this section from our README:

STAR utilizes two parameters for optimal identification of multi-mappers --outFilterMultimapNmax and --outAnchorMultimapNmax. The author of STAR recommends that --winAnchorMultimapNmax should be set at twice the value used in --outFilterMultimapNmax, but no less than 50. In our study, we used the same number for both parameters (100), and found negligible differences in identifying multi-mappers. Upon further discussion with the author of STAR, we recommend that setting the same value for --winAnchorMultimapNmax and --outFilterMultimapNmax, though we highly suggest users test multiple values of --winAnchorMultimapNmax to identify the optimal value for their experiment.

In addition, we also strongly recommend against using the --outSAMmultNmax parameter (i.e. leave it at default), as this would limit the number of alignments reported into the SAM file, though removing the benefits of multi-mapping.

We have found that the STAR parameters used for ENCODE RNA-seq mapping works for us (note the addition of --outFilterMultimapNmax and --winAnchorMultimapNmax, which we use to allow TE quantification):

STAR --genomeDir [STAR index] --readFilesIn [R1 FASTQ] [R2 FASTQ]                    \
    --readFilesCommand zcat --runThreadN 10--genomeLoad NoSharedMemory      \
    --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1    \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04              \
    --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000         \
    --outSAMheaderHD @HD VN:1.4 SO:coordinate   --outSAMunmapped Within \
    --outFilterType BySJout --outSAMattributes NH HI AS NM MD \
    --outSAMtype BAM SortedByCoordinate  --sjdbScore 1  --limitBAMsortRAM 30000000000 \
    --outFilterMultimapNmax 100 --winAnchorMultimapNmax 150

If your genome is vastly different to those above, (e.g. way larger or have more repetitive sequences), we recommend a saturation analysis to determine the best multi-mapping parameters (see #151 for more information).

Thanks.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days