mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

error at line 285 #8

Closed jofrank1988 closed 3 years ago

jofrank1988 commented 3 years ago

I am running into the following error with RNAseq data downloaded from SRA. fastq files were generated by downloading with fastqdump. The reads were processed with fastp using default parameters then mapped to the hg38 assembly with HISAT2 and converted to sorted bam files. I've successfully run TElocal on datasets generated in house.

When I run TElocal on this dataset I get the following error:

[INFO  @ Wed, 05 Aug 2020 15:01:45: 
# ARGUMENTS LIST:
# name = SRR11294324_TElocal_out
# BAM file = SRR11294324.bam
# GTF file = /home/jaf266/project/genomes/human/hg38.ncbiRefSeq.gtf 
# TE file = /home/jaf266/project/genomes/human/hg38_rmsk_TElocus.ind 
# multi-mapper mode = uniq 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Wed, 05 Aug 2020 15:01:45: Processing annotation files ...

INFO  @ Wed, 05 Aug 2020 15:01:45: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
1800000 GTF lines processed.
1900000 GTF lines processed.
INFO  @ Wed, 05 Aug 2020 15:09:13: Done building gene index ...... 

INFO  @ Wed, 05 Aug 2020 15:09:13: File extension indicates a pre-built TE index, attempting to load ...... 

INFO  @ Wed, 05 Aug 2020 15:10:43: TE index loaded ...... 

INFO  @ Wed, 05 Aug 2020 15:10:43: 
Reading sample file ... 

******NOT COMPLETE*******
If the BAM file is sorted by coordinates, please specify --sortByPos and re-run!
Error: 0
[Exception type: SystemExit, raised in TElocal:285]]

I also looked at the input fastq files and it seems like the reads are appropriately paired but found that some read information seems to be missing: @SRR11294324.1.1 1 length=101 TTGACCTGGTCTCCTGGTACATCTTCCCAGTTTTTGGTGACTCCCTTCAGTTTCTCTGAGAGCTCCAGGTTACACTCCTTCTCTGCTTCCACCAGAGCTGC +SRR11294324.1.1 1 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@SRR11294324.1.2 1 length=99
ATGAGCTCTTGGAGAATAAGATGCAGCTTCTGCAGGAGGAATCCAGGCTAGCAAAGAATGAAGCTGCGCGGATGGCAGCTCTGGTGGAAGCAGAGAAGG
+SRR11294324.1.2 1 length=99
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I am new to RNAseq analyses so I am not really sure how to troubleshoot this problem. Any help would be appreciated.

olivertam commented 3 years ago

Hi,

Thank you for your interest in the software. Your error looks like an issue with the BAM file containing alignments where one end is mapped, but the other is not (we call these "mixed" alignment pairs). This is certainly observed in paired-end data, and can cause some issues with downstream analysis. There are a few things that you can do to address this:

  1. Use STAR, which do not output these "mixed" alignments by default
  2. Use the --no-mixed parameter of HISAT2 to exclude these "mixed" alignments.

We are in the process of improving the quantification of paired-end libraries to better handle these "mixed" alignments, but as they break the assumptions from paired-end mapping, they are not trivial to address.

Thanks again.

jofrank1988 commented 3 years ago

Thank you for your help. I attempted to run TE local after remapping with HISAT2 and the --no-mixed parameter but still ran into the same issue.

hisat2 --no-mixed --rna-strandness FR --fr -p 8 -x /home/jaf266/project/genomes/human/hg38 -1 out_SRR11294324_1.fastq -2 out_SRR11294324_2.fastq -S SRR11294324.sam; \ samtools view -b SRR11294324.sam | samtools sort - -o SRR11294324.bam; \ samtools index SRR11294324.bam SRR11294324.bam.bai

Reran TE local and got the following error.

INFO  @ Thu, 06 Aug 2020 10:43:47: 
# ARGUMENTS LIST:
# name = SRR11294324_TElocal_out
# BAM file = SRR11294324.bam
# GTF file = /home/jaf266/project/genomes/human/hg38.ncbiRefSeq.gtf 
# TE file = /home/jaf266/project/genomes/human/hg38_rmsk_TElocus.ind 
# multi-mapper mode = uniq 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Thu, 06 Aug 2020 10:43:47: Processing annotation files ...

INFO  @ Thu, 06 Aug 2020 10:43:47: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
1800000 GTF lines processed.
1900000 GTF lines processed.
INFO  @ Thu, 06 Aug 2020 10:51:11: Done building gene index ...... 

INFO  @ Thu, 06 Aug 2020 10:51:11: File extension indicates a pre-built TE index, attempting to load ...... 

INFO  @ Thu, 06 Aug 2020 10:52:49: TE index loaded ...... 

INFO  @ Thu, 06 Aug 2020 10:52:49: 
Reading sample file ... 

******NOT COMPLETE*******
If the BAM file is sorted by coordinates, please specify --sortByPos and re-run!
Error: 0
[Exception type: SystemExit, raised in TElocal:285]

I will try with STAR next to see if that helps. Any other suggestions would be appreciated.

olivertam commented 3 years ago

Hi,

Would you mind sending me the headers of your BAM file? You should be able to get that using samtools -H SRR11294324.bam

I also just read through the HISAT2 parameters more carefully, and you might need the --no-discordant parameter as well. Just a thought.

Also, could you confirm that the --rna-strandedness option for that library is FR? I don't think it affects TElocal, but it implies that this is a forward-stranded library, where read 1 corresponds to direction of transcription. Note that Illumina TruSeq stranded libraries are reverse-stranded (read 2 corresponds to direction of transcription), and would use --rna-strandedness RF.

It would also mean that you should run TElocal with --stranded forward as the parameter, rather than --stranded no by default.

Thanks

jofrank1988 commented 3 years ago

I've attached the file generated using your suggested command

[jaf266@ruddle1 alquassim2020]$ samtools view -H SRR11294324.bam > SRR11294324_bam_head

SRR11294324_bam_head.txt

I went back to the paper based on your comment about library strandedness, and it looks like this data was generated using the Illumina TruSeq RNA Exome protocol.

Thank you for your patience and help.

olivertam commented 3 years ago

Thanks for the headers I would definitely suggest trying either STAR or adding the --no-discordant parameters to HISAT2 to see if you can get over that error. If not, we might try to do some testing on our end with that SRA accession.

jofrank1988 commented 3 years ago

Thank you!

I will add the --no-discordant parameter and run HISAT because I already have the code set up.

I also wanted to ask: Since this is a a Truseq library and I have to set --rna-strandedness RF, does this mean the --fr parameter needs to be changed to --rf?

olivertam commented 3 years ago

Hi,

This can get very confusing (even for me). The --fr refers to the genomic alignment of reads 1 and 2 relative to each other. The following is from HISAT2's documentation:

if --fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 ... that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 ... that too is valid. ... Default: --fr (appropriate for Illumina's Paired-end Sequencing Assay).

So in this case, you want --fr because you're expecting that the "upstream" alignment (read 1 or 2) to be going forward (on the + strand), while the "downstream" alignment to be going reverse (on the - strand).

This is different from the --rna-strandedness, which is now looking at sense/antisense relative to the transcript. In --rna-strandedness RF, it is now saying that read 1 is reverse (so antisense to the transcript), while read 2 is forward (sense to the transcript). This is determined not just by the strand (+/-) of the read alignment, but also on the strand of the annotation.

A pretty good explanation of this could be found here. Note that in this post, "firststrand" corresponds to --rna-strandedness RF while "secondstrand" corresponds to --rna-strandedness FR.

TL;DR: use --rna-strandedness RF, but keep --fr

Hope this is helpful. Please let me know if I should clarify further.

jofrank1988 commented 3 years ago

Thank you for all of your help. After following all of your suggestions, it turns out that simply using STAR worked out best.