mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

If the BAM file is sorted by coordinates,please specify --sortByPos and re-run! #161

Closed SPRINNNGY closed 6 months ago

SPRINNNGY commented 7 months ago

Hi, I'm having the same problem, always reporting an error about using the sortByPos parameter. I have made sure that my bam file is sorted and also using samtools sort -n -@ 32 SRR17551591_sorted_rmdup_paired.bam | samtools fixmate -m - - | samtools sort -o SRR17551591_sorted_ rmdup_paired_fixed.bam -@ 32 command to fix the problem that the read1 and read2 counts are not the same. However, I ended up using TEtranscripts and still reported an error about using the sortByPos parameter. Also, my bam file is compared using bwa, do I have to use STAR? bwa mem -t 32 /public/home/aci97gjgm5/INDEX/bwaindex/hg19.fa fastqc/SRR17551590_1.fastq fastqc/SRR17551590_2.fastq -o bwa/SRR17551590.sam The flagstat report for one of the bams after I fixed the bam is shown below: 9391344 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 1957338 + 0 supplementary 0 + 0 duplicates 9391344 + 0 mapped (100.00% : N/A) 6996408 + 0 paired in sequencing 3498204 + 0 read1 3498204 + 0 read2 6996408 + 0 properly paired (100.00% : N/A) 6996408 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

olivertam commented 7 months ago

Hi,

Thank you for your interest in the software. We do recommend STAR, as it is designed more for RNA-seq (and can handle multimappers better), whereas BWA tends to report only one of the possible alignment. Please note that although you have "fixed" the pairs, there are still more mapped reads (9391344) than properly paired reads (6996408). This suggests that there are still alignments that are not considered "properly paired", which will throw out these errors (again, I need to read up on how flagstat make these counts).

Thanks.

SPRINNNGY commented 7 months ago

Hi, I used STAR and it did run. But after sixteen hours of running, it reports an error:

INFO  @ Fri, 10 Nov 2023 03:10:20: Finished processing sample files 
INFO  @ Fri, 10 Nov 2023 03:10:20: Generating counts table 
INFO  @ Fri, 10 Nov 2023 03:10:20: Performing differential analysis ...

Error in running differential analysis! 
Error: [Errno 2] No such file or directory 
[Exception type: OSError, raised in subprocess.py:1047] 
olivertam commented 7 months ago

Hi,

What was your command line? Could you provide the list of output files from that run?

Thanks.

SPRINNNGY commented 7 months ago

TEtranscripts -t CIRI/STAR/SRR17551591Aligned.sortedByCoord.out.bam CIRI/STAR/SRR17551593Aligned.sortedByCoord.out.bam CIRI/STAR/SRR17551595Aligned.sortedByCoord.out.bam -c CIRI/STAR/SRR17551590Aligned.sortedByCoord.out.bam CIRI/STAR/SRR17551592Aligned.sortedByCoord.out.bam CIRI/STAR/SRR17551594Aligned.sortedByCoord.out.bam --TE hg19_rmsk_TE.gtf --format BAM --mode multi --GTF /public/home/aci97gjgm5/GC/hg19.refGene.gtf --project TEtranscripts_out --outdir TEtranscripts_out --sortByPos

olivertam commented 7 months ago

Hi,

Were there any output from that command? E.g. TEtranscripts_out.cntTable or TEtranscripts_DESeq.R?

Thanks.

SPRINNNGY commented 7 months ago

1699792416042 TEtranscripts_out_DESeq2.R: data <- read.table("TEtranscripts_out.cntTable",header=T,row.names=1) groups <- factor(c(rep("TGroup",3),rep("CGroup",3))) min_read <- 1 data <- data[apply(data,1,function(x){max(x)}) > min_read,] sampleInfo <- data.frame(groups,row.names=colnames(data)) suppressPackageStartupMessages(library(DESeq2)) dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups) dds$groups = relevel(dds$groups,ref="CGroup") dds <- DESeq(dds) res <- results(dds) write.table(res, file="TEtranscripts_out_gene_TE_analysis.txt", sep="\t",quote=F) resSig <- res[(!is.na(res$padj) & (res$padj < 0.050000) & (abs(res$log2FoldChange)> 0.000000)), ] write.table(resSig, file="TEtranscripts_out_sigdiff_gene_TE.txt",sep="\t", quote=F)

TEtranscripts_out.cntTable: 1699792533349

olivertam commented 7 months ago

Hi,

Could you try running the following? $ Rscript TEtranscripts_out_DESeq2.R Thanks.

Cheers, Oliver

SPRINNNGY commented 7 months ago

I tried and I found out that it's because I don't have R loaded in my environment yet and I don't have deseq2 installed. May I ask if I run this TEtranscripts_out_DESeq2.R script after installing deseq2, will the whole TEtranscripts process be over? Do I still need to rerun my TEtranscripts commands? Thanks.

olivertam commented 7 months ago

No worries. Yes, you can just run the R script and it will finish the last portion (differential analysis). You do not need to re-run TEtranscripts.

Thanks.

SPRINNNGY commented 7 months ago

Hi, after I run the R script, there is no Alu flanking the circular RNA in my sample, only a very few non-Alu TEs and a large number of genes, so strange.Here's my two txt.Thanks. TEtranscripts_out_gene_TE_analysis.txt TEtranscripts_out_sigdiff_gene_TE.txt

olivertam commented 7 months ago

Hi,

I'm not sure what your question is. Are you working with RNAseq data?

Thanks.

SPRINNNGY commented 7 months ago

yeah,here's my datahttps://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA796293&o=acc_s%3Aa I used the first six paired data.

olivertam commented 7 months ago

Hi,

I'm not sure that TEtranscripts is suited to circular RNA analysis, as it isn't trying to annotate portions/flanking regions of a read, but assigning the read to the most likely annotation. Thus, it is probably identifying reads to be largely gene-exon derived, and thus treating them as genic.

Thanks.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days