Closed jofrank1988 closed 3 years ago
Hi,
Thank you for your interest in the software. Your error looks like an issue with the BAM file containing alignments where one end is mapped, but the other is not (we call these "mixed" alignment pairs). This is certainly observed in paired-end data, and can cause some issues with downstream analysis. There are a few things that you can do to address this:
STAR
, which do not output these "mixed" alignments by default--no-mixed
parameter of HISAT2 to exclude these "mixed" alignments.We are in the process of improving the quantification of paired-end libraries to better handle these "mixed" alignments, but as they break the assumptions from paired-end mapping, they are not trivial to address.
Thanks again.
Thank you for your help. I attempted to run TE local after remapping with HISAT2 and the --no-mixed
parameter but still ran into the same issue.
hisat2 --no-mixed --rna-strandness FR --fr -p 8 -x /home/jaf266/project/genomes/human/hg38 -1 out_SRR11294324_1.fastq -2 out_SRR11294324_2.fastq -S SRR11294324.sam; \ samtools view -b SRR11294324.sam | samtools sort - -o SRR11294324.bam; \ samtools index SRR11294324.bam SRR11294324.bam.bai
Reran TE local and got the following error.
INFO @ Thu, 06 Aug 2020 10:43:47:
# ARGUMENTS LIST:
# name = SRR11294324_TElocal_out
# BAM file = SRR11294324.bam
# GTF file = /home/jaf266/project/genomes/human/hg38.ncbiRefSeq.gtf
# TE file = /home/jaf266/project/genomes/human/hg38_rmsk_TElocus.ind
# multi-mapper mode = uniq
# stranded = no
# number of iteration = 100
# Alignments grouped by read ID = False
INFO @ Thu, 06 Aug 2020 10:43:47: Processing annotation files ...
INFO @ Thu, 06 Aug 2020 10:43:47: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
1600000 GTF lines processed.
1700000 GTF lines processed.
1800000 GTF lines processed.
1900000 GTF lines processed.
INFO @ Thu, 06 Aug 2020 10:51:11: Done building gene index ......
INFO @ Thu, 06 Aug 2020 10:51:11: File extension indicates a pre-built TE index, attempting to load ......
INFO @ Thu, 06 Aug 2020 10:52:49: TE index loaded ......
INFO @ Thu, 06 Aug 2020 10:52:49:
Reading sample file ...
******NOT COMPLETE*******
If the BAM file is sorted by coordinates, please specify --sortByPos and re-run!
Error: 0
[Exception type: SystemExit, raised in TElocal:285]
I will try with STAR next to see if that helps. Any other suggestions would be appreciated.
Hi,
Would you mind sending me the headers of your BAM file? You should be able to get that using
samtools -H SRR11294324.bam
I also just read through the HISAT2 parameters more carefully, and you might need the --no-discordant
parameter as well. Just a thought.
Also, could you confirm that the --rna-strandedness
option for that library is FR
? I don't think it affects TElocal, but it implies that this is a forward-stranded library, where read 1 corresponds to direction of transcription. Note that Illumina TruSeq stranded libraries are reverse-stranded (read 2 corresponds to direction of transcription), and would use --rna-strandedness RF
.
It would also mean that you should run TElocal
with --stranded forward
as the parameter, rather than --stranded no
by default.
Thanks
I've attached the file generated using your suggested command
[jaf266@ruddle1 alquassim2020]$ samtools view -H SRR11294324.bam > SRR11294324_bam_head
I went back to the paper based on your comment about library strandedness, and it looks like this data was generated using the Illumina TruSeq RNA Exome protocol.
Thank you for your patience and help.
Thanks for the headers
I would definitely suggest trying either STAR
or adding the --no-discordant
parameters to HISAT2
to see if you can get over that error. If not, we might try to do some testing on our end with that SRA accession.
Thank you!
I will add the --no-discordant
parameter and run HISAT because I already have the code set up.
I also wanted to ask: Since this is a a Truseq library and I have to set --rna-strandedness RF
, does this mean the --fr
parameter needs to be changed to --rf
?
Hi,
This can get very confusing (even for me). The --fr
refers to the genomic alignment of reads 1 and 2 relative to each other. The following is from HISAT2
's documentation:
if --fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 ... that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 ... that too is valid. ... Default: --fr (appropriate for Illumina's Paired-end Sequencing Assay).
So in this case, you want --fr
because you're expecting that the "upstream" alignment (read 1 or 2) to be going forward (on the +
strand), while the "downstream" alignment to be going reverse (on the -
strand).
This is different from the --rna-strandedness
, which is now looking at sense/antisense relative to the transcript. In --rna-strandedness RF
, it is now saying that read 1 is reverse (so antisense to the transcript), while read 2 is forward (sense to the transcript). This is determined not just by the strand (+/-
) of the read alignment, but also on the strand of the annotation.
A pretty good explanation of this could be found here.
Note that in this post, "firststrand" corresponds to --rna-strandedness RF
while "secondstrand" corresponds to --rna-strandedness FR
.
TL;DR: use --rna-strandedness RF
, but keep --fr
Hope this is helpful. Please let me know if I should clarify further.
Thank you for all of your help. After following all of your suggestions, it turns out that simply using STAR worked out best.
I am running into the following error with RNAseq data downloaded from SRA. fastq files were generated by downloading with fastqdump. The reads were processed with fastp using default parameters then mapped to the hg38 assembly with HISAT2 and converted to sorted bam files. I've successfully run TElocal on datasets generated in house.
When I run TElocal on this dataset I get the following error:
I also looked at the input fastq files and it seems like the reads are appropriately paired but found that some read information seems to be missing:
@SRR11294324.1.1 1 length=101 TTGACCTGGTCTCCTGGTACATCTTCCCAGTTTTTGGTGACTCCCTTCAGTTTCTCTGAGAGCTCCAGGTTACACTCCTTCTCTGCTTCCACCAGAGCTGC +SRR11294324.1.1 1 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I am new to RNAseq analyses so I am not really sure how to troubleshoot this problem. Any help would be appreciated.