hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

Identifying positive/negative strand mapping from eventalign #136

Closed hengjwj closed 1 year ago

hengjwj commented 1 year ago

Hi Hasindu,

Thanks for creating this quick tool! It's really useful.

I work with direct RNA sequencing data and have a question regarding the strand mapping of f5c eventalign.

From the f5c documentation at https://hasindu2008.github.io/f5c/docs/output#rna-examples, I see that the event_index (column 6) increases(+ strand)/decreases(- strand) as the position (column 2) increases. However, in my data, I see that this is not always the case. For example, here's what I see in a read that mapped to the positive strand:

chr1L 24895420 TCCTG 0 t 2533 120.35 7.934 0.00232 CAGGA 98.38 7.84 2.51 23603 23610 chr1L 24895421 CCTGA 0 t 2532 93.22 2.941 0.01394 TCAGG 89.11 5.09 0.72 23610 23652 chr1L 24895421 CCTGA 0 t 2531 83.96 4.920 0.01461 TCAGG 89.11 5.09 -0.91 23652 23696 chr1L 24895421 CCTGA 0 t 2530 82.05 1.486 0.00266 TCAGG 89.11 5.09 -1.24 23696 23704 chr1L 24895421 CCTGA 0 t 2529 85.41 1.504 0.00863 TCAGG 89.11 5.09 -0.65 23704 23730 chr1L 24895421 CCTGA 0 t 2528 82.16 1.222 0.00232 TCAGG 89.11 5.09 -1.22 23730 23737 chr1L 24895422 CTGAC 0 t 2527 80.49 1.424 0.00631 GTCAG 75.63 2.30 1.89 23737 23756

I also see the opposite in a read that mapped to the minus strand:

chr1S 48668715 TACTT 10 t 535 90.58 2.154 0.00398 TACTT 88.10 2.73 0.75 83092 83104 chr1S 48668716 ACTTT 10 t 536 79.76 1.487 0.00598 ACTTT 82.38 2.42 -0.90 83074 83092 chr1S 48668717 CTTTC 10 t 537 79.78 0.556 0.00398 CTTTC 77.72 1.97 0.86 83062 83074 chr1S 48668718 TTTCA 10 t 538 81.32 1.412 0.00232 TTTCA 78.13 2.07 1.27 83055 83062

What I did:

f5c index -d $dir_fast5/ $dir_fastqbam/$fqCombined
f5c eventalign --reads $dir_fastqbam/$fqCombined --bam $dir_fastqbam/$sortedBam --genome $ref --summary $dir_f5c/$in_eventalign_summary --threads $numcore --scale-events --samples --signal-index  --iop 8 --min-mapq 0 --secondary=yes --rna > $dir_f5c/$in_eventalign

#Get reads mapped to plus or minus strand from BAM
samtools view -F 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_plus
samtools view -f 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_minus

zcat $dir_f5c/$in_eventalign | awk 'BEGIN {FS=OFS="\t"} NR == FNR { id[$1]; next } NR == 1 || $4 in id { print }' ${expt}.read_idx_plus - > $dir_f5c/${expt}.eventalign.reads_plus
zcat $dir_f5c/$in_eventalign | awk 'BEGIN {FS=OFS="\t"} NR == FNR { id[$1]; next } NR == 1 || $4 in id { print }' ${expt}.read_idx_minus - > $dir_f5c/${expt}.eventalign.reads_minus

What's the most reliable way to separate reads mapping to the positive and negative strand?

hasindu2008 commented 1 year ago

Hi

I would want a little for information to debug this issue.

Can you do the following?

f5c eventalign --reads $dir_fastqbam/$fqCombined --bam $dir_fastqbam/$sortedBam --genome $ref --summary $dir_f5c/$in_eventalign_summary --threads $numcore --scale-events --samples --signal-index  --iop 8 --min-mapq 0 --secondary=yes --rna --print-read-names

Then for the example you provided for a read mapped to the positive strand, can you find the read ID and get the mapping information, for instance samtools view $dir_fastqbam/$in_bam | grep <readid> | cut -f 1-5?

I wonder if this example is of a secondary alignment mapped to the negative strand?

Also, you can also try executing f5c with --seconday=no and see if these anomalies are still there.

hengjwj commented 1 year ago

Hey @hasindu2008,

Thanks for getting back so quickly on this. I can't run it right now as I don't have my laptop with me for the next few days.

Nonetheless, I mapped with minimap2 using --secondary=no as follows:

minimap2 -ax splice -uf --secondary=no -k14 $ref $fqCombined | samtools view -@ $numcore -S -u | samtools sort -@ $numcore -o $sortedBam -

I also used samtools to filter for the reads that exclusively have the flag 16 or 0 in the lines above that begin with :

samtools view -F 16 -F 4 -F 2048 $dir_fastqbam/$in_bam samtools view -f 16 -F 4 -F 2048 $dir_fastqbam/$in_bam

Checking column 2 of the output of those last two lines confirm this so I think it's unlikely to be secondary reads. If it helps, I'll run

f5c eventalign --reads $dir_fastqbam/$fqCombined --bam $dir_fastqbam/$sortedBam --genome $ref --summary $dir_f5c/$in_eventalign_summary --threads $numcore --scale-events --samples --signal-index --iop 8 --min-mapq 0 --secondary=yes --rna --print-read-names

Have you seen anything similar before in your experience?

Joel

hasindu2008 commented 1 year ago

Hi Joel,

I went through some of my RNA data and could not yet find a case like yours. Would you be able to share a tiny dataset with a few such anomalous reads? You may use samtools fqidx $dir_fastqbam/$fqCombined <readid1> <reeadid2> ... > a.fastq to extract the basecalled reads and you can use slow5tools get merged.blow5 <readid1> <reeadid2> ... -o a.blow5 to get the raw signals. Note you will have to convert the FAST5 to BLOW5 using slow5tools for this. See https://hasindu2008.github.io/slow5tools/workflows.html. Then if you send the link to the reference, I can do the alignment and inspect carefully what is going on.

hasindu2008 commented 1 year ago

@hengjwj Any luck with extracting a little example dataset?

hengjwj commented 1 year ago

Hi @hasindu2008,

Thanks for looking into this and sorry for the delay (I was travelling!).

I've extracted nine reads here: https://entuedu-my.sharepoint.com/:f:/g/personal/s190075_e_ntu_edu_sg/ErWZaj68LitMtoMRzgBzV7MBZQmB6hhVidpNT2ea16NEAA?e=G8sF53

For the reference, I use the one from UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/xenLae2/bigZips/xenLae2.fa.gz

Joel

hasindu2008 commented 1 year ago

OK. I executed the following.

minimap2 -ax splice -uf --secondary=no -k14 xenLae2.fa hasindu.fastq | samtools sort - -o reads.bam && samtools index  reads.bam
f5c index hasindu.fastq --slow5 hasindu.blow5
./f5c eventalign -r hasindu.fastq -b reads.bam -g xenLae2.fa --slow5 hasindu.blow5 -o a.tsv --print-read-name --signal-index

Her is the tsv file I get. a.tsv.gz

I quickly went through like 5 alignments and to me it seems to be correct. Can you extract the tsv.gz file I attached above and see if you can find an example which is not right?

hengjwj commented 1 year ago

Hi @hasindu2008 ,

I'm confused with the order of the event_index, signal_index, and what determines whether the model_kmer is the same as reference_kmer. Ultimately, I'm trying to assign a strand to each read (is there a reason why the strand column is legacy?).

From my BAM file, I see that the reads are mapped to the following strands:

acbebdd8-bc24-4672-a703-9cad4510dc71 + 5a7560b9-f2f3-4490-964a-dc7316018640 + bb5946e5-76d4-4cff-87fd-200b1dae6fd2 + 3a97ba23-0fc4-4921-9231-9eed250cf223 + 61bb5797-ea51-4eb0-91b4-2846e65668c2 - 3699b5ee-cca8-4bc8-a686-a949759bee49 - e340e80f-5ca6-44a0-b775-ecfc140c3071 - 73433de2-6264-49e0-9508-0f758870e884 + 3b3e12db-b72c-4020-8408-1db38908142e -

================================================================================== From the documentation, I expect the following (please correct me if I'm mistaken or there's a better way of thinking about it!):

Event_index

The event_index goes in increasing order from 5'->3'. For a read mapped to the (+) strand, event_index goes in the same direction as position and increases as position increases. Conversely, for a read mapped to the (-) strand, event_index goes in an opposite direction from position.

Signal_index (start_idx, end_idx)

The signal_index goes in increasing order from 3'->5'. For a read mapped to the (+) strand, signal_index goes in the opposite direction as position, decreasing as position increases. Conversely, for a read mapped to the (-) strand, signal_index goes in the same direction as position.

Reference_kmer and model_kmer

For a read mapped to the (+) strand, reference_kmer and model_kmer should match (with the exception of NNNNN). For a read mapped to the (-) strand, reference_kmer and model_kmer should not match (they should be reverse complements of each other)

==================================================================================

For read "61bb5797-ea51-4eb0-91b4-2846e65668c2", which maps to (-), there seems to be two loci (chr1S, chr3S) that this read maps to. For the mapping to chr1S, event_index and signal_index follow what I described above but not the kmers. The chr3S mapping follows all three points in what I described above.

For read "acbebdd8-bc24-4672-a703-9cad4510dc71", which maps to (+), there seems to be two loci (chr1L, chr3S) that this read maps to also. For the mapping to chr1L, all three points are not followed. However, for the chr3S mapping, all three points are followed.

I don't know why these reads (and probably the others) map to more than one location but should one mapping be taken to be the "correct" one and the "wrong" one discarded?

Joel

hasindu2008 commented 1 year ago

Hi @hengjwj

This thing can be indeed confusing with +, -, RNA directions and all. I can still get confused sometimes.

This strand column was used in Nanopolish when R7 chemistry had these 2D reads (as far as I am aware), where both the DNA strands in a fragment got sequenced one after the other all the time. This strand column had been used in nanopolish for that kind of reads. Now with R9, this is no longer applicable.

Your explanation about the indexes and kmers is totally right unless I missed something when quickly reading.

Now let us go into your examples:

Read 61bb5797-ea51-4eb0-91b4-2846e65668c2:

In the BAM file, there are three alignments:

#readid                                                             flag     chr       pos                 mapq
61bb5797-ea51-4eb0-91b4-2846e65668c2    2048    chr1S   48668672        60
61bb5797-ea51-4eb0-91b4-2846e65668c2    16      chr3S   111893614       60
61bb5797-ea51-4eb0-91b4-2846e65668c2    2048    chr4S   27037178        1

The mapping to chr3S (- strand) is the primary alignment. The mappings to chr1S and chr4S are supplementary alignments that map to + strand.

In f5c event alignment, you will see entries for only chr3S and chr1S. The chr4S has a very low mapq (<20) and thus f5c ignores such low-quality alignments unless you specifically use the -min-mapq option.

The chr3S is a - alignment, whereas the chr1S is a + alignment. However, it seems like you have assumed that both of these are - alignments.

Hint: You may use https://broadinstitute.github.io/picard/explain-flags.htm to decode the flags or convert sam to paf using samtools view reads.bam -h | paftools.js sam2paf -.

chr1s (+): event idx increase, signal_idx decrease, equivalent kmers

chr1S   48668671    TTTGC   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   1   67.53   1.706   0.00332 NNNNN   0.00    0.00    inf 93210   93220
chr1S   48668703    TTCCA   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   2   67.10   1.580   0.01494 TTCCA   69.69   3.37    -0.64   93165   93210
chr1S   48668703    TTCCA   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   3   64.20   1.286   0.00232 TTCCA   69.69   3.37    -1.35   93158   93165
c

chr3s (-): event_idx decrease, signal_index increase, nonmatch kmers

chr3S   111893613   CTGTT   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   2156    83.44   0.897   0.00498 AACAG   85.63   4.66    -0.39   50503   50518
chr3S   111893613   CTGTT   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   2155    82.42   1.311   0.01195 AACAG   85.63   4.66    -0.57   50518   50554
chr3S   111893613   CTGTT   61bb5797-ea51-4eb0-91b4-2846e65668c2    t   2154    76.73   1.107   0.00564 AACAG   85.63   4.66    -1.58   50554   50571

Read acbebdd8-bc24-4672-a703-9cad4510dc71:

acbebdd8-bc24-4672-a703-9cad4510dc71    2064    chr1L   24884018        24
acbebdd8-bc24-4672-a703-9cad4510dc71    0       chr3S   35261772        60
acbebdd8-bc24-4672-a703-9cad4510dc71    2048    chrUn_NW_016708406v1    374     60

Primary alignment is the mapping to chr3S (+ strand). Mapping to chr1L is a supplementary alignment to the - strand, where as chrUn_NW_016708406v1 is a supplementary to the + strand. However, it seems you have assumed all of these are + alignments.

In f5c event alignment, all these are present as all of them have >20 mapq. chrUn_NW_016708406v1 and chr3S are + and the chr1L is -.

chr3S (+):

chr3S   35261772    ATAGG   acbebdd8-bc24-4672-a703-9cad4510dc71    t   1734    72.26   2.059   0.00332 ATAGG   77.34   4.71    -0.97   38586   38596
chr3S   35261772    ATAGG   acbebdd8-bc24-4672-a703-9cad4510dc71    t   1735    77.31   1.508   0.00531 ATAGG   77.34   4.71    -0.01   38570   38586
chr3S   35261772    ATAGG   acbebdd8-bc24-4672-a703-9cad4510dc71    t   1736    74.16   2.563   0.00266 ATAGG   77.34   4.71    -0.61   38562   38570

chrUn_NW_016708406v1(+):

chrUn_NW_016708406v1    373 CACCA   acbebdd8-bc24-4672-a703-9cad4510dc71    t   4   103.35  4.013   0.00299 NNNNN   0.00    0.00    inf 70655   70664
chrUn_NW_016708406v1    376 CACTC   acbebdd8-bc24-4672-a703-9cad4510dc71    t   5   79.25   4.340   0.00930 CACTC   76.88   3.40    0.62    70627   70655
chrUn_NW_016708406v1    376 CACTC   acbebdd8-bc24-4672-a703-9cad4510dc71    t   6   80.70   2.865   0.00465 CACTC   76.88   3.40    1.01    70613   70627
c

chr1L (-):

chr1L   24895415    TTGCT   acbebdd8-bc24-4672-a703-9cad4510dc71    t   2544    129.60  7.972   0.00564 NNNNN   0.00    0.00    inf 23349   23366
chr1L   24895417    GCTCC   acbebdd8-bc24-4672-a703-9cad4510dc71    t   2543    115.54  12.248  0.00232 GGAGC   111.29  7.09    0.54    23366   23373
chr1L   24895417    GCTCC   acbebdd8-bc24-4672-a703-9cad4510dc71    t   2542    124.38  4.735   0.01062 GGAGC   111.29  7.09    1.66    23373   23405

To me, all these alignments are just as intended.

About the best alignment, it is hard to say, depends on the problem you are trying to solve. In these 2 examples reads, different parts of the reads have mapped to different contigs (they are called chimeric reads).

After converting to PAF we see that:

61bb5797-ea51-4eb0-91b4-2846e65668c2    1571    794 1491    +   chr1S   179946965   48668671    48669367
61bb5797-ea51-4eb0-91b4-2846e65668c2    1571    18  688 -   chr3S   120483113   111893613   111894298   
61bb5797-ea51-4eb0-91b4-2846e65668c2    1571    286 947 +   chr4S   121245977   27037177    27039557    

read 61bb5797-ea51-4eb0-91b4-2846e65668c2's 18-688 region has mapped to chr3S, 286-947 to chr4S and 794-1491 to chr1S. If you strictly want one alignment for a read, you should filter out all the supplementary and secondary mappings, only leaving the primaries

hasindu2008 commented 1 year ago

@hengjwj Hope the above explanation shed some light. If not feel free to ask more questions.

hengjwj commented 1 year ago

Hi Hasindu,

Thanks for the comprehensive explanation!

I thought I had removed all secondary and supplementary alignments in these lines:

samtools view -F 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_plus
samtools view -f 16 -F 4 -F 2048 $dir_fastqbam/$in_bam | \
awk 'BEGIN {FS=OFS="\t"} NR == FNR { lines_to_include[$1]; next } $2 in lines_to_include {print $1}' - $dir_f5c/$in_eventalign_summary > ${expt}.read_idx_minus

but I found out that I was matching by read_name and not read_index. Because of this, it also retrieved all supplementary alignments (secondary alignments were removed by Minimap2).

I think I'll have to filter the BAM file (samtools view -F 4 -F 256 -F 2048) and re-run f5c, unless there's a way to filter out the supplementary alignments in f5c eventalign output somehow.

Nonetheless, thanks for clarifying!

Joel

hasindu2008 commented 1 year ago

Currently, f5c has an option to disable/enable secondary mappings. Unfortunately, I have not implemented such an option for supplementary mappings. The recommended approach is to use samtools (like your example samtools commands above) to filter out such reads and running f5c on such a BAM file.

hengjwj commented 1 year ago

Got it! Thanks! Will close this issue.