Closed hengjwj closed 1 year ago
Hi
I would want a little for information to debug this issue.
Can you do the following?
f5c eventalign --reads $dir_fastqbam/$fqCombined --bam $dir_fastqbam/$sortedBam --genome $ref --summary $dir_f5c/$in_eventalign_summary --threads $numcore --scale-events --samples --signal-index --iop 8 --min-mapq 0 --secondary=yes --rna --print-read-names
Then for the example you provided for a read mapped to the positive strand, can you find the read ID and get the mapping information, for instance samtools view $dir_fastqbam/$in_bam | grep <readid> | cut -f 1-5
?
I wonder if this example is of a secondary alignment mapped to the negative strand?
Also, you can also try executing f5c with --seconday=no
and see if these anomalies are still there.
Hey @hasindu2008,
Thanks for getting back so quickly on this. I can't run it right now as I don't have my laptop with me for the next few days.
Nonetheless, I mapped with minimap2 using --secondary=no as follows:
minimap2 -ax splice -uf --secondary=no -k14 $ref $fqCombined | samtools view -@ $numcore -S -u | samtools sort -@ $numcore -o $sortedBam -
I also used samtools to filter for the reads that exclusively have the flag 16 or 0 in the lines above that begin with :
samtools view -F 16 -F 4 -F 2048 $dir_fastqbam/$in_bam samtools view -f 16 -F 4 -F 2048 $dir_fastqbam/$in_bam
Checking column 2 of the output of those last two lines confirm this so I think it's unlikely to be secondary reads. If it helps, I'll run
f5c eventalign --reads $dir_fastqbam/$fqCombined --bam $dir_fastqbam/$sortedBam --genome $ref --summary $dir_f5c/$in_eventalign_summary --threads $numcore --scale-events --samples --signal-index --iop 8 --min-mapq 0 --secondary=yes --rna --print-read-names
Have you seen anything similar before in your experience?
Joel
Hi Joel,
I went through some of my RNA data and could not yet find a case like yours. Would you be able to share a tiny dataset with a few such anomalous reads?
You may use samtools fqidx $dir_fastqbam/$fqCombined <readid1> <reeadid2> ... > a.fastq
to extract the basecalled reads and you can use slow5tools get merged.blow5 <readid1> <reeadid2> ... -o a.blow5
to get the raw signals.
Note you will have to convert the FAST5 to BLOW5 using slow5tools for this. See https://hasindu2008.github.io/slow5tools/workflows.html.
Then if you send the link to the reference, I can do the alignment and inspect carefully what is going on.
@hengjwj Any luck with extracting a little example dataset?
Hi @hasindu2008,
Thanks for looking into this and sorry for the delay (I was travelling!).
I've extracted nine reads here: https://entuedu-my.sharepoint.com/:f:/g/personal/s190075_e_ntu_edu_sg/ErWZaj68LitMtoMRzgBzV7MBZQmB6hhVidpNT2ea16NEAA?e=G8sF53
For the reference, I use the one from UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/xenLae2/bigZips/xenLae2.fa.gz
Joel
OK. I executed the following.
minimap2 -ax splice -uf --secondary=no -k14 xenLae2.fa hasindu.fastq | samtools sort - -o reads.bam && samtools index reads.bam
f5c index hasindu.fastq --slow5 hasindu.blow5
./f5c eventalign -r hasindu.fastq -b reads.bam -g xenLae2.fa --slow5 hasindu.blow5 -o a.tsv --print-read-name --signal-index
Her is the tsv file I get. a.tsv.gz
I quickly went through like 5 alignments and to me it seems to be correct. Can you extract the tsv.gz file I attached above and see if you can find an example which is not right?
I'm confused with the order of the event_index, signal_index, and what determines whether the model_kmer is the same as reference_kmer. Ultimately, I'm trying to assign a strand to each read (is there a reason why the strand column is legacy?).
From my BAM file, I see that the reads are mapped to the following strands:
acbebdd8-bc24-4672-a703-9cad4510dc71 + 5a7560b9-f2f3-4490-964a-dc7316018640 + bb5946e5-76d4-4cff-87fd-200b1dae6fd2 + 3a97ba23-0fc4-4921-9231-9eed250cf223 + 61bb5797-ea51-4eb0-91b4-2846e65668c2 - 3699b5ee-cca8-4bc8-a686-a949759bee49 - e340e80f-5ca6-44a0-b775-ecfc140c3071 - 73433de2-6264-49e0-9508-0f758870e884 + 3b3e12db-b72c-4020-8408-1db38908142e -
================================================================================== From the documentation, I expect the following (please correct me if I'm mistaken or there's a better way of thinking about it!):
The event_index goes in increasing order from 5'->3'. For a read mapped to the (+) strand, event_index goes in the same direction as position and increases as position increases. Conversely, for a read mapped to the (-) strand, event_index goes in an opposite direction from position.
The signal_index goes in increasing order from 3'->5'. For a read mapped to the (+) strand, signal_index goes in the opposite direction as position, decreasing as position increases. Conversely, for a read mapped to the (-) strand, signal_index goes in the same direction as position.
For a read mapped to the (+) strand, reference_kmer and model_kmer should match (with the exception of NNNNN). For a read mapped to the (-) strand, reference_kmer and model_kmer should not match (they should be reverse complements of each other)
==================================================================================
For read "61bb5797-ea51-4eb0-91b4-2846e65668c2", which maps to (-), there seems to be two loci (chr1S, chr3S) that this read maps to. For the mapping to chr1S, event_index and signal_index follow what I described above but not the kmers. The chr3S mapping follows all three points in what I described above.
For read "acbebdd8-bc24-4672-a703-9cad4510dc71", which maps to (+), there seems to be two loci (chr1L, chr3S) that this read maps to also. For the mapping to chr1L, all three points are not followed. However, for the chr3S mapping, all three points are followed.
I don't know why these reads (and probably the others) map to more than one location but should one mapping be taken to be the "correct" one and the "wrong" one discarded?
Joel
Hi @hengjwj
This thing can be indeed confusing with +, -, RNA directions and all. I can still get confused sometimes.
This strand column was used in Nanopolish when R7 chemistry had these 2D reads (as far as I am aware), where both the DNA strands in a fragment got sequenced one after the other all the time. This strand column had been used in nanopolish for that kind of reads. Now with R9, this is no longer applicable.
Your explanation about the indexes and kmers is totally right unless I missed something when quickly reading.
Now let us go into your examples:
Read 61bb5797-ea51-4eb0-91b4-2846e65668c2:
In the BAM file, there are three alignments:
#readid flag chr pos mapq
61bb5797-ea51-4eb0-91b4-2846e65668c2 2048 chr1S 48668672 60
61bb5797-ea51-4eb0-91b4-2846e65668c2 16 chr3S 111893614 60
61bb5797-ea51-4eb0-91b4-2846e65668c2 2048 chr4S 27037178 1
The mapping to chr3S (- strand) is the primary alignment. The mappings to chr1S and chr4S are supplementary alignments that map to + strand.
In f5c event alignment, you will see entries for only chr3S and chr1S. The chr4S has a very low mapq (<20) and thus f5c ignores such low-quality alignments unless you specifically use the -min-mapq option.
The chr3S is a - alignment, whereas the chr1S is a + alignment. However, it seems like you have assumed that both of these are - alignments.
Hint: You may use https://broadinstitute.github.io/picard/explain-flags.htm to decode the flags or convert sam to paf using samtools view reads.bam -h | paftools.js sam2paf -
.
chr1s (+): event idx increase, signal_idx decrease, equivalent kmers
chr1S 48668671 TTTGC 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 1 67.53 1.706 0.00332 NNNNN 0.00 0.00 inf 93210 93220
chr1S 48668703 TTCCA 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 2 67.10 1.580 0.01494 TTCCA 69.69 3.37 -0.64 93165 93210
chr1S 48668703 TTCCA 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 3 64.20 1.286 0.00232 TTCCA 69.69 3.37 -1.35 93158 93165
c
chr3s (-): event_idx decrease, signal_index increase, nonmatch kmers
chr3S 111893613 CTGTT 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 2156 83.44 0.897 0.00498 AACAG 85.63 4.66 -0.39 50503 50518
chr3S 111893613 CTGTT 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 2155 82.42 1.311 0.01195 AACAG 85.63 4.66 -0.57 50518 50554
chr3S 111893613 CTGTT 61bb5797-ea51-4eb0-91b4-2846e65668c2 t 2154 76.73 1.107 0.00564 AACAG 85.63 4.66 -1.58 50554 50571
Read acbebdd8-bc24-4672-a703-9cad4510dc71:
acbebdd8-bc24-4672-a703-9cad4510dc71 2064 chr1L 24884018 24
acbebdd8-bc24-4672-a703-9cad4510dc71 0 chr3S 35261772 60
acbebdd8-bc24-4672-a703-9cad4510dc71 2048 chrUn_NW_016708406v1 374 60
Primary alignment is the mapping to chr3S (+ strand). Mapping to chr1L is a supplementary alignment to the - strand, where as chrUn_NW_016708406v1 is a supplementary to the + strand. However, it seems you have assumed all of these are + alignments.
In f5c event alignment, all these are present as all of them have >20 mapq. chrUn_NW_016708406v1 and chr3S are + and the chr1L is -.
chr3S (+):
chr3S 35261772 ATAGG acbebdd8-bc24-4672-a703-9cad4510dc71 t 1734 72.26 2.059 0.00332 ATAGG 77.34 4.71 -0.97 38586 38596
chr3S 35261772 ATAGG acbebdd8-bc24-4672-a703-9cad4510dc71 t 1735 77.31 1.508 0.00531 ATAGG 77.34 4.71 -0.01 38570 38586
chr3S 35261772 ATAGG acbebdd8-bc24-4672-a703-9cad4510dc71 t 1736 74.16 2.563 0.00266 ATAGG 77.34 4.71 -0.61 38562 38570
chrUn_NW_016708406v1(+):
chrUn_NW_016708406v1 373 CACCA acbebdd8-bc24-4672-a703-9cad4510dc71 t 4 103.35 4.013 0.00299 NNNNN 0.00 0.00 inf 70655 70664
chrUn_NW_016708406v1 376 CACTC acbebdd8-bc24-4672-a703-9cad4510dc71 t 5 79.25 4.340 0.00930 CACTC 76.88 3.40 0.62 70627 70655
chrUn_NW_016708406v1 376 CACTC acbebdd8-bc24-4672-a703-9cad4510dc71 t 6 80.70 2.865 0.00465 CACTC 76.88 3.40 1.01 70613 70627
c
chr1L (-):
chr1L 24895415 TTGCT acbebdd8-bc24-4672-a703-9cad4510dc71 t 2544 129.60 7.972 0.00564 NNNNN 0.00 0.00 inf 23349 23366
chr1L 24895417 GCTCC acbebdd8-bc24-4672-a703-9cad4510dc71 t 2543 115.54 12.248 0.00232 GGAGC 111.29 7.09 0.54 23366 23373
chr1L 24895417 GCTCC acbebdd8-bc24-4672-a703-9cad4510dc71 t 2542 124.38 4.735 0.01062 GGAGC 111.29 7.09 1.66 23373 23405
To me, all these alignments are just as intended.
About the best alignment, it is hard to say, depends on the problem you are trying to solve. In these 2 examples reads, different parts of the reads have mapped to different contigs (they are called chimeric reads).
After converting to PAF we see that:
61bb5797-ea51-4eb0-91b4-2846e65668c2 1571 794 1491 + chr1S 179946965 48668671 48669367
61bb5797-ea51-4eb0-91b4-2846e65668c2 1571 18 688 - chr3S 120483113 111893613 111894298
61bb5797-ea51-4eb0-91b4-2846e65668c2 1571 286 947 + chr4S 121245977 27037177 27039557
read 61bb5797-ea51-4eb0-91b4-2846e65668c2's 18-688 region has mapped to chr3S, 286-947 to chr4S and 794-1491 to chr1S. If you strictly want one alignment for a read, you should filter out all the supplementary and secondary mappings, only leaving the primaries
@hengjwj Hope the above explanation shed some light. If not feel free to ask more questions.
Hi Hasindu,
Thanks for the comprehensive explanation!
I thought I had removed all secondary and supplementary alignments in these lines:
samtools view -F 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_plus
samtools view -f 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_minus
but I found out that I was matching by read_name and not read_index. Because of this, it also retrieved all supplementary alignments (secondary alignments were removed by Minimap2).
I think I'll have to filter the BAM file (samtools view -F 4 -F 256 -F 2048) and re-run f5c, unless there's a way to filter out the supplementary alignments in f5c eventalign output somehow.
Nonetheless, thanks for clarifying!
Joel
Currently, f5c has an option to disable/enable secondary mappings. Unfortunately, I have not implemented such an option for supplementary mappings. The recommended approach is to use samtools (like your example samtools commands above) to filter out such reads and running f5c on such a BAM file.
Got it! Thanks! Will close this issue.
Hi Hasindu,
Thanks for creating this quick tool! It's really useful.
I work with direct RNA sequencing data and have a question regarding the strand mapping of f5c eventalign.
From the f5c documentation at https://hasindu2008.github.io/f5c/docs/output#rna-examples, I see that the event_index (column 6) increases(+ strand)/decreases(- strand) as the position (column 2) increases. However, in my data, I see that this is not always the case. For example, here's what I see in a read that mapped to the positive strand:
I also see the opposite in a read that mapped to the minus strand:
What I did:
What's the most reliable way to separate reads mapping to the positive and negative strand?