hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

Inconsistent strand output for direct RNA-seq reads #98

Closed Kevinzjy closed 2 years ago

Kevinzjy commented 2 years ago

Hi, thanks for developing such an amazing tool, it largely speeds up our analysis for analyzing signal-level data.

However, I have noticed that the output kmer model for direct RNA-seq reads is in the reverse complementary order. Is it designed on purpose? Considering that ONT direct RNA-seq method reads the RNA strand from 3' to 5', and signals should be reverted to get the correct RNA sequence. So, maybe it's more appropriate to convert these kmers to the 5'-3' order.

e.g. f5c will generate the reverse complementary model kmer:

ENST00000457540.1       3       AATCC   27      t       996     141.37  5.017   0.01295 GGATT   129.83  5.10    1.79    11580   11619
ENST00000457540.1       4       ATCCC   27      t       995     128.06  5.748   0.01029 GGGAT   124.02  3.17    1.00    11619   11650
ENST00000457540.1       4       ATCCC   27      t       994     112.33  3.297   0.00232 GGGAT   124.02  3.17    -2.91   11650   11657
ENST00000457540.1       4       ATCCC   27      t       993     121.18  4.649   0.00266 GGGAT   124.02  3.17    -0.71   11657   11665
ENST00000457540.1       4       ATCCC   27      t       992     127.13  7.856   0.00697 GGGAT   124.02  3.17    0.77    11665   11686

However, nanopolish will convert these kmers to the correct order:

ENST00000393000.3       33      CGCGG   13      t       2       68.39   0.887   0.00232 CGCGG   92.98   8.39    -2.54   33531   33538
ENST00000393000.3       33      CGCGG   13      t       3       76.88   1.882   0.01295 CGCGG   92.98   8.39    -1.66   33492   33531
ENST00000393000.3       34      GCGGC   13      t       4       91.66   5.726   0.00963 GCGGC   90.29   4.98    0.24    33463   33492
ENST00000393000.3       34      GCGGC   13      t       5       83.71   4.941   0.00332 GCGGC   90.29   4.98    -1.15   33453   33463
ENST00000393000.3       34      GCGGC   13      t       6       92.16   3.637   0.00365 GCGGC   90.29   4.98    0.33    33442   33453
ENST00000393000.3       34      GCGGC   13      t       7       95.99   2.488   0.00664 GCGGC   90.29   4.98    0.99    33422   33442
ENST00000393000.3       35      CGGCG   13      t       8       112.02  5.079   0.00498 CGGCG   107.83  5.31    0.68    33407   33422
Kevinzjy commented 2 years ago

Sorry, it's a read mapped to the reverse strand of the transcript, the kmer signal for most kmer is fine. Thanks !

hasindu2008 commented 2 years ago

Some time ago I checked the output of f5c with nanopolish for RNA and it matched exactly. Are there any differences that you observe now? If so please provide any such k-mers that are different so that I can investigate.

Kevinzjy commented 2 years ago

Some time ago I checked the output of f5c with nanopolish for RNA and it matched exactly. Are there any differences that you observe now? If so please provide any such k-mers that are different so that I can investigate.

The f5c result is correct because I forgot to check the mapping strand of RNA reads. Thanks for the prompt reply!