hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
137 stars 28 forks source link

Can f5c eventalign output base calling sequences ? #170

Open kir1to455 opened 1 month ago

kir1to455 commented 1 month ago

Hi @hasindu2008 , Thank you for developing f5c ! I wonder know whether f5c eventalign can output basecalling kmer? Generally, f5c eventalign will output the following header files.

contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx samples transcript 9 TTATA 0 t 27 93.44 2.558 0.00465 TTATA 93.29 2.63 0.05 143742 143756 94.4323,94.7009,93.4921,94.5666,96.3125,94.7009,96.1782,92.9549,93.6264,94.8352,91.4775,94.8352,89.4629,86.6424

Whether a column basecalling kmer can be added at the end? Like this contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx samples basecalling_kmer transcript 9 TTATA 0 t 27 93.44 2.558 0.00465 TTATA 93.29 2.63 0.05 143742 143756 94.4323,94.7009,93.4921,94.5666,96.3125,94.7009,96.1782,92.9549,93.6264,94.8352,91.4775,94.8352,89.4629,86.6424 TTATA

I think the reason for this is that basecalling error occurs in fastq files. (Like A-->T mutation, or deletion and so on) Take a example. Here is my eventalign file. image Here is the tsv file I generated with sam2tsv. image

In position 39, the reference kmer is CTTTC. But its sequence in fastq file is TTTTC.

Best wishes, Kirito

hasindu2008 commented 1 month ago

Hello, I can add an option like --print-read-kmer that will give a column called read_kmer? The reason I do not want to add by default is to prevent this tsv file from getting too large - it is large enough currently. Will this do?

kir1to455 commented 1 month ago

Of course! In addition, --print-read-kmer can be chose for 9-kmer or 5-kmer? In downstream work, I usually integrate the results of eventalign to generate 9-kmer files.

Thank you for your quick reply. Best wishes, Kirito

hasindu2008 commented 1 month ago

Can I know if you are working with dna or rna? Also is it r9 or r10 if DNA, or rna002 or rna004 if RNA?

Another question I have is, are you only interested in the signal alignment to base-called read? If that is the case, would the f5c resquiggle subprogramme be able to provide what you are after?

kir1to455 commented 1 month ago

Hi, @hasindu2008 I'm working wit rna data both rna002 and rna004. Actually, I'm using f5c eventalign to train a model for RNA methylation. So the f5c resquiggle may not meet my needs. I can also use Python to extract the kmer of each read ID at each position, but this is too slow and I am not an expert in C language.

Best wishes, Kirito

hasindu2008 commented 1 month ago

Hey

I looked through the code and realised implementing this requires several modifications to the code, especially because some of the read-related metadata is not passed to the HMM step. Before putting in the effort and time, I would like to know your intention of relying on the read_kmer. If it is for training a model, wouldn't relying on the read_kmer instead of the reference_kmer or the model_kmer, make it biased towards errors in existing basecaller models?

Especially when base-modifications are present, the basecallers tend to call such bases erroneously. So I think it would be more sensible to use the reference_kmer than the read_k mer in this case.

kir1to455 commented 1 month ago

Hi, @hasindu2008 I input both reference_kmer and read_kmer simultaneously with one hot encode matrix. An important point is that errors in basecalling do not occur randomly, but rather follow a regular pattern next to modified bases. Take a example. In this case, the last column is read_kmer, It occurs deletion at two downstream bases of m6A. In machine learning, by entering reference_kmer and read_kmer, it will recognize errors in these bases and know which ones are true m6A. image

Best wishes, Kirito

hasindu2008 commented 1 month ago

@kir1to455

I have attempted to implement this feature in a separate branch https://github.com/hasindu2008/f5c/tree/read_kmer. It prints two columns: read_pos and read_kmer

However, I could not find time to test if the implementation is correct, especially in corner cases - so there could be mistakes. Would you be able to run on your data and manually inspect a few cases (like in the figure you posted above) to see if things are good?

You do not need --print-read-kmer in this branch because by default the read_kmer are printed. Later if this is to merge to the main branch, I will introduce this as an option --print-read-kmer

To build this branch:

git clone https://github.com/hasindu2008/f5c/ -b read_kmer
autoreconf 
scripts/install-hts.sh  # download and compile the htslib
./configure
make                    # make cuda=1 to enable CUDA support
kir1to455 commented 1 month ago

Hi, @hasindu2008 I'm sorry to reply now. I was on business for a week last week.

I tested this new branch and found several obvious problems.

  1. In the first and second lines, the position changes but the read_pos does not. This leads to errors in all subsequent read kmers. image And in position 16 and position 17, read_pos from 3 to 7. Something wrong here? image

  2. In read_kmer, it seems that nothing can represent deletion. It seems to ignore deletion in position 18. The true position kmer should be CCC.T, and not CCCTG kmer. image image

I think deletion is a very important factor in training the model. Because the m6A context deletion will have a great impact on the eventalign signal there.

Best wishes, Kirito

hasindu2008 commented 1 month ago

Hello,

Would you be able to send me this read you are looking at (the FASTQ and the SLOW5)? You can do a slow5tools get reads.blow5 readid -o readid.blow5 to get a particular read. Also, let me know which reference you are using and I should download it.

If you provide a small test case like what you are showing would be really helpful. While I could create a small set myself, I really appreciate if you could share yours to start with as I am a bit pressed for finding time recently. Thanks.

kir1to455 commented 1 month ago

Hi, @hasindu2008 I have uploaded the top 20 reads files as the test set. If you have any other requirements, please tell me. mapping Step: minimap2 -ax map-ont -L -un -k 8 -w 1 --secondary=no -t 36 -p 0.99 test.fa test.fastq | samtools view -F 2324 -b -@ 36 -o test.bam Sam2tsv Step: sam2tsv -R test.fa test.bam -o test.sam2tsv.tsv Eventalign Step: ${f5c_dir}/f5c eventalign --reads test.fastq --bam test.bam --genome test.fa --slow5 test.blow5 -t 35 --min-mapq 0 --secondary=no --rna --signal-index --scale-events --collapse-events --samples -B 14M -K 2048 --cuda-dev-id 0 --summary test.summary.txt | pigz > test.eventalign.tsv.gz

testfile.zip

Best wishes, Kirito

hasindu2008 commented 1 month ago

Thanks. I will have a look at this soon sometime later during this week when I get some time.

hasindu2008 commented 1 month ago

I had a look and the discrepancy between them is because the read-kmers are printed based on the initial read-to-event alignment step, done at the first stage of the f5c eventalign programme (called adaptive banded event alignment - see https://hasindu2008.github.io/f5c/nanopore_signal_alignment_supplementary_material.pdf for details). This initial alignment is a crude alignment, which is then refined to a more accurate alignment using HMM and the reference. Because this initial alignment is crude, it would be some times off from the actual alignment as you observe and have very little deletions. So perhaps what you are after is the read-kmers printed based on the CIGAR string (in BAM file), rather than this result from adaptive banded event alignment?

kir1to455 commented 4 weeks ago

Hi, @hasindu2008

So perhaps what you are after is the read-kmers printed based on the CIGAR string (in BAM file)

Yes, I want to use the read-kmer based on the CIGAR string in bam file.

hasindu2008 commented 3 weeks ago

@kir1to455 Over the weekend, I got this CIGAR-based read-kmer thing implemented. Would you be able to have a look?

contig  position    reference_kmer  read_name   strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  read_pos    read_kmer
E   10  TATAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   1   93.54   1.102   0.00299 TATAC   95.09   6.28    -0.23   0   TATAC
E   10  TATAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   2   92.40   1.473   0.00465 NNNNN   0.00    0.00    inf 0   TATAC
E   12  TACCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   3   84.25   6.243   0.00930 TACCT   79.02   2.81    1.70    2   TACCT
E   13  ACCTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   4   71.24   1.824   0.01461 ACCTT   72.21   2.35    -0.38   3   ACCTT
E   14  CCTTC   17ba731f-98f9-4d62-8660-b10c17235d65    t   5   69.37   0.907   0.00299 CCTTC   75.32   2.42    -2.25   4   CCTTC
E   15  CTTCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   6   71.33   1.592   0.00697 CTTCC   74.22   2.07    -1.27   5   CTTCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   7   65.91   1.122   0.01627 TTCCC   68.77   2.30    -1.14   6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   8   67.31   0.867   0.00299 TTCCC   68.77   2.30    -0.58   6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   9   66.40   1.004   0.01295 TTCCC   68.77   2.30    -0.94   6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   10  73.45   1.951   0.01062 TTCCC   68.77   2.30    1.86    6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   11  71.79   1.170   0.00232 TTCCC   68.77   2.30    1.20    6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   12  73.47   1.554   0.00299 TTCCC   68.77   2.30    1.87    6   TTCCC
E   16  TTCCC   17ba731f-98f9-4d62-8660-b10c17235d65    t   13  67.69   0.744   0.00465 TTCCC   68.77   2.30    -0.43   6   TTCCC
E   17  TCCCA   17ba731f-98f9-4d62-8660-b10c17235d65    t   14  66.47   0.907   0.00498 TCCCA   64.83   2.07    0.72    7   TCCCT
E   17  TCCCA   17ba731f-98f9-4d62-8660-b10c17235d65    t   15  62.26   1.074   0.00498 TCCCA   64.83   2.07    -1.14   7   TCCCT
E   18  CCCAG   17ba731f-98f9-4d62-8660-b10c17235d65    t   16  68.51   2.112   0.00299 CCCAG   71.95   2.11    -1.49   8   CCCTG
E   18  CCCAG   17ba731f-98f9-4d62-8660-b10c17235d65    t   17  107.28  11.300  0.00764 NNNNN   0.00    0.00    inf 8   CCCTG
E   23  GTAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   18  86.77   2.427   0.01096 GTAAC   87.41   3.04    -0.19   12  GTAAC
E   24  TAACA   17ba731f-98f9-4d62-8660-b10c17235d65    t   19  93.93   2.873   0.00365 TAACA   93.83   3.45    0.03    13  TAACA
E   24  TAACA   17ba731f-98f9-4d62-8660-b10c17235d65    t   20  91.70   1.486   0.00199 TAACA   93.83   3.45    -0.57   13  TAACA
E   25  AACAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   21  87.51   1.707   0.00299 AACAA   87.82   3.18    -0.09   14  AACAA
E   25  AACAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   22  84.47   1.322   0.00332 AACAA   87.82   3.18    -0.96   14  AACAA
E   25  AACAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   23  86.27   1.391   0.00531 AACAA   87.82   3.18    -0.45   14  AACAA
E   26  ACAAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   24  81.30   1.865   0.00266 ACAAA   82.45   3.02    -0.35   15  ACAAA
E   26  ACAAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   25  83.27   1.521   0.00830 ACAAA   82.45   3.02    0.25    15  ACAAA
E   27  CAAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   26  105.75  3.077   0.01494 CAAAC   104.22  2.68    0.52    16  CAAAC
E   27  CAAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   27  84.11   1.725   0.01162 NNNNN   0.00    0.00    inf 16  CAAAC
E   29  AACCA   17ba731f-98f9-4d62-8660-b10c17235d65    t   28  82.73   1.828   0.00564 AACCA   82.92   2.81    -0.06   18  AACCA
E   30  ACCAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   29  75.96   1.435   0.00830 ACCAA   74.58   2.11    0.60    19  ACCAA
E   31  CCAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   30  86.31   1.666   0.00299 CCAAC   85.16   3.02    0.35    20  CCAAC
E   31  CCAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   31  83.43   1.316   0.00232 CCAAC   85.16   3.02    -0.52   20  CCAAC
E   31  CCAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   32  88.68   1.904   0.00564 CCAAC   85.16   3.02    1.06    20  CCAAC
E   31  CCAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   33  84.48   1.887   0.00498 CCAAC   85.16   3.02    -0.21   20  CCAAC
E   32  CAACC   17ba731f-98f9-4d62-8660-b10c17235d65    t   34  90.35   3.208   0.00299 CAACC   89.37   3.45    0.26    21  CAACC
E   32  CAACC   17ba731f-98f9-4d62-8660-b10c17235d65    t   35  86.93   3.433   0.00498 CAACC   89.37   3.45    -0.65   21  CAACC
E   33  AACCA   17ba731f-98f9-4d62-8660-b10c17235d65    t   36  80.52   1.474   0.00266 AACCA   82.92   2.81    -0.78   22  AACCA
E   34  ACCAA   17ba731f-98f9-4d62-8660-b10c17235d65    t   37  76.60   1.601   0.00830 ACCAA   74.58   2.11    0.88    23  ACCAT
E   35  CCAAC   17ba731f-98f9-4d62-8660-b10c17235d65    t   38  88.80   1.205   0.00232 CCAAC   85.16   3.02    1.10    24  CCATT
E   36  CAACT   17ba731f-98f9-4d62-8660-b10c17235d65    t   39  91.88   1.182   0.00697 CAACT   96.11   3.45    -1.12   25  CATTT
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   40  91.50   1.696   0.00797 AACTT   92.87   2.73    -0.46   .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   41  93.03   1.487   0.00498 AACTT   92.87   2.73    0.05    .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   42  97.74   1.844   0.00299 AACTT   92.87   2.73    1.63    .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   43  91.90   1.740   0.00299 AACTT   92.87   2.73    -0.32   .   .
E   38  ACTTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   44  81.34   1.426   0.00465 ACTTT   82.38   2.42    -0.39   26  ATTTT
E   39  CTTTC   17ba731f-98f9-4d62-8660-b10c17235d65    t   45  77.98   1.227   0.00398 CTTTC   77.72   1.97    0.12    27  TTTTC
E   40  TTTCG   17ba731f-98f9-4d62-8660-b10c17235d65    t   46  80.87   1.196   0.00697 TTTCG   79.07   2.07    0.79    28  TTTCG
E   40  TTTCG   17ba731f-98f9-4d62-8660-b10c17235d65    t   47  79.44   2.088   0.00664 TTTCG   79.07   2.07    0.17    28  TTTCG
E   41  TTCGA   17ba731f-98f9-4d62-8660-b10c17235d65    t   48  87.37   2.478   0.00697 TTCGA   84.70   2.30    1.06    29  TTCGA
E   42  TCGAT   17ba731f-98f9-4d62-8660-b10c17235d65    t   49  99.04   3.304   0.00266 TCGAT   99.04   3.99    0.00    30  TCGAT
E   42  TCGAT   17ba731f-98f9-4d62-8660-b10c17235d65    t   50  103.69  1.882   0.00398 TCGAT   99.04   3.99    1.07    30  TCGAT
E   43  CGATC   17ba731f-98f9-4d62-8660-b10c17235d65    t   51  117.71  6.769   0.01627 CGATC   117.72  5.10    -0.00   1   ATACC
E   44  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   52  98.72   2.460   0.00465 GATCT   95.38   5.70    0.53    2   TACCT
E   44  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   53  90.61   4.844   0.00266 GATCT   95.38   5.70    -0.77   2   TACCT
E   44  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   54  83.23   4.277   0.00564 GATCT   95.38   5.70    -1.95   2   TACCT
E   45  ATCTC   17ba731f-98f9-4d62-8660-b10c17235d65    t   55  75.25   1.181   0.00498 ATCTC   76.56   2.07    -0.58   3   ACCTT
E   45  ATCTC   17ba731f-98f9-4d62-8660-b10c17235d65    t   56  75.89   1.135   0.00299 ATCTC   76.56   2.07    -0.30   3   ACCTT
E   46  TCTCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   57  78.40   1.240   0.00797 TCTCT   79.44   2.85    -0.34   4   CCTTC
E   47  CTCTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   58  80.19   0.976   0.01793 CTCTT   79.01   2.07    0.52    5   CTTCC
E   47  CTCTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   59  81.89   0.647   0.00266 CTCTT   79.01   2.07    1.28    5   CTTCC
E   47  CTCTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   60  78.30   1.424   0.00664 CTCTT   79.01   2.07    -0.32   5   CTTCC
E   48  TCTTG   17ba731f-98f9-4d62-8660-b10c17235d65    t   61  85.21   1.177   0.00498 TCTTG   80.89   2.42    1.63    6   TTCCC
E   48  TCTTG   17ba731f-98f9-4d62-8660-b10c17235d65    t   62  82.00   1.307   0.00365 TCTTG   80.89   2.42    0.42    6   TTCCC
E   49  CTTGT   17ba731f-98f9-4d62-8660-b10c17235d65    t   63  86.62   2.146   0.00896 CTTGT   86.77   2.65    -0.05   7   TCCCT
E   50  TTGTA   17ba731f-98f9-4d62-8660-b10c17235d65    t   64  98.97   1.391   0.00332 TTGTA   99.35   3.78    -0.09   8   CCCTG
E   51  TGTAG   17ba731f-98f9-4d62-8660-b10c17235d65    t   65  106.88  7.324   0.00896 TGTAG   106.93  8.56    -0.01   9   CCTGT
E   52  GTAGA   17ba731f-98f9-4d62-8660-b10c17235d65    t   66  100.77  5.392   0.00498 GTAGA   102.41  3.78    -0.40   10  CTGTA
E   52  GTAGA   17ba731f-98f9-4d62-8660-b10c17235d65    t   67  108.24  2.692   0.00332 GTAGA   102.41  3.78    1.41    10  CTGTA
E   53  TAGAT   17ba731f-98f9-4d62-8660-b10c17235d65    t   68  131.21  1.155   0.00365 TAGAT   126.66  5.87    0.71    .   .
E   54  AGATC   17ba731f-98f9-4d62-8660-b10c17235d65    t   69  139.84  4.096   0.00432 AGATC   134.08  5.10    1.03    11  TGTAA
E   55  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   70  107.10  7.655   0.00697 GATCT   95.38   5.70    1.88    12  GTAAC
E   55  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   71  92.67   4.136   0.00863 GATCT   95.38   5.70    -0.44   12  GTAAC
E   55  GATCT   17ba731f-98f9-4d62-8660-b10c17235d65    t   72  86.46   4.580   0.00432 GATCT   95.38   5.70    -1.43   12  GTAAC
E   56  ATCTG   17ba731f-98f9-4d62-8660-b10c17235d65    t   73  75.55   1.159   0.00299 ATCTG   77.11   2.07    -0.69   13  TAACA
E   56  ATCTG   17ba731f-98f9-4d62-8660-b10c17235d65    t   74  74.25   0.992   0.00266 ATCTG   77.11   2.07    -1.27   13  TAACA
E   57  TCTGT   17ba731f-98f9-4d62-8660-b10c17235d65    t   75  84.84   1.324   0.00398 TCTGT   84.95   2.85    -0.04   14  AACAA
kir1to455 commented 2 weeks ago

Hi, @hasindu2008 Thank you for your efforts. There still seem to be some problems. In bam file, read "17ba731f-98f9-4d62-8660-b10c17235d65" in position 22 has a Deletion. image

In eventalign(0-based) , there should be a deletion at position 21. So in this line, E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 14 66.47 0.907 0.00498 TCCCA 64.83 2.07 0.72 7 TCCCT E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 15 62.26 1.074 0.00498 TCCCA 64.83 2.07 -1.14 7 TCCCT readkmer should be TCCC. , "." means deletion. Because base A is deletion here. (Or you can use "" presents deletion, I see that the modkit of nanopore does this)

Similarly, E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 16 68.51 2.112 0.00299 CCCAG 71.95 2.11 -1.49 8 CCCTG E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 17 107.28 11.300 0.00764 NNNNN 0.00 0.00 inf 8 CCCTG read_kmer should be CCC.G .

Best wishes, Kirito

hasindu2008 commented 1 week ago

@kir1to455 I had a look over the weekend and due to the way that f5c internally keeps metadata, it is a lot of work to implement such a kmer format. The reason is, f5c internally just keeps the k-mer position (index) on the read and then when printing the output it prints the relevant k-mer that starts from the given index. Is it possible for you to use the "." in the read k-mer position to deduce the necessary format?

E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   40  91.50   1.696   0.00797 AACTT   92.87   2.73    -0.46   .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   41  93.03   1.487   0.00498 AACTT   92.87   2.73    0.05    .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   42  97.74   1.844   0.00299 AACTT   92.87   2.73    1.63    .   .
E   37  AACTT   17ba731f-98f9-4d62-8660-b10c17235d65    t   43  91.90   1.740   0.00299 AACTT   92.87   2.73    -0.32   .   .

For instance, "." in this case says that the position incurs a deletion.

Alternatively,

kir1to455 commented 1 week ago

Hi, @hasindu2008 It seems hard to use "." in the read k-mer position to deduce the format. The reason is that if eventalign didn't have the position, it will not output the "." in read k-mer. At position 37,it had a deletion. So I can deduce the foramt. Similarly, at position 21, it also had a deletion. But eventalign didn't have the position. So I couldn't know In position 17 the motif should be TCCCA or TCCC.

Thank you for your help! If f5c is hard to implement this function, then forget it. Best wishes, Kirito

hasindu2008 commented 1 week ago

Oh, I see what you mean. I think the way to go ahead in f5c to implement this would be to add an additional step that parses the cigar string that deduce the format that you require.

Is this for a mind-storming/prototyping project you are requiring this option or is it that you already have a working script but that wants something faster? If it is the former, I would suggest that you write a script that parses the CIGAR in the BAM file in combination with the event align TSV to print the custom output. If it is later, I am happy to put the effort to implement this into f5c as it is worth the effort if it is an already working pipeline that needs performance improvement.

kir1to455 commented 6 days ago

Hi, @hasindu2008 I already have a python script to parse the CIGAR in the BAM file. I used sam2tsv to output the bases at all positions. However, for each line (each read and position), it needs to iterate through the entire read to determine what the base is at this position. This adds a lot of unnecessary overhead. I believe C language can handle this better.

Best wishes, Kirito

hasindu2008 commented 5 days ago

From your description, I could not fully understand. If you email me the part of the algorithm that does this to hasindu@unsw.edu.au or put it here, I can see if I can understand. Also, for this read "17ba731f-98f9-4d62-8660-b10c17235d65" do you have the intended output? How do you handle the insertions?

kir1to455 commented 4 days ago

Hi, @hasindu2008

I have sent the code to your email, please check.