Open kir1to455 opened 4 months ago
Hello,
I can add an option like --print-read-kmer
that will give a column called read_kmer
?
The reason I do not want to add by default is to prevent this tsv file from getting too large - it is large enough currently.
Will this do?
Of course! In addition, --print-read-kmer can be chose for 9-kmer or 5-kmer? In downstream work, I usually integrate the results of eventalign to generate 9-kmer files.
Thank you for your quick reply. Best wishes, Kirito
Can I know if you are working with dna or rna? Also is it r9 or r10 if DNA, or rna002 or rna004 if RNA?
Another question I have is, are you only interested in the signal alignment to base-called read? If that is the case, would the f5c resquiggle
subprogramme be able to provide what you are after?
Hi, @hasindu2008 I'm working wit rna data both rna002 and rna004. Actually, I'm using f5c eventalign to train a model for RNA methylation. So the f5c resquiggle may not meet my needs. I can also use Python to extract the kmer of each read ID at each position, but this is too slow and I am not an expert in C language.
Best wishes, Kirito
Hey
I looked through the code and realised implementing this requires several modifications to the code, especially because some of the read-related metadata is not passed to the HMM step. Before putting in the effort and time, I would like to know your intention of relying on the read_kmer. If it is for training a model, wouldn't relying on the read_kmer instead of the reference_kmer or the model_kmer, make it biased towards errors in existing basecaller models?
Especially when base-modifications are present, the basecallers tend to call such bases erroneously. So I think it would be more sensible to use the reference_kmer than the read_k mer in this case.
Hi, @hasindu2008 I input both reference_kmer and read_kmer simultaneously with one hot encode matrix. An important point is that errors in basecalling do not occur randomly, but rather follow a regular pattern next to modified bases. Take a example. In this case, the last column is read_kmer, It occurs deletion at two downstream bases of m6A. In machine learning, by entering reference_kmer and read_kmer, it will recognize errors in these bases and know which ones are true m6A.
Best wishes, Kirito
@kir1to455
I have attempted to implement this feature in a separate branch https://github.com/hasindu2008/f5c/tree/read_kmer. It prints two columns: read_pos and read_kmer
However, I could not find time to test if the implementation is correct, especially in corner cases - so there could be mistakes. Would you be able to run on your data and manually inspect a few cases (like in the figure you posted above) to see if things are good?
You do not need --print-read-kmer
in this branch because by default the read_kmer are printed.
Later if this is to merge to the main branch, I will introduce this as an option --print-read-kmer
To build this branch:
git clone https://github.com/hasindu2008/f5c/ -b read_kmer
autoreconf
scripts/install-hts.sh # download and compile the htslib
./configure
make # make cuda=1 to enable CUDA support
Hi, @hasindu2008 I'm sorry to reply now. I was on business for a week last week.
I tested this new branch and found several obvious problems.
In the first and second lines, the position changes but the read_pos does not. This leads to errors in all subsequent read kmers. And in position 16 and position 17, read_pos from 3 to 7. Something wrong here?
In read_kmer, it seems that nothing can represent deletion. It seems to ignore deletion in position 18. The true position kmer should be CCC.T, and not CCCTG kmer.
I think deletion is a very important factor in training the model. Because the m6A context deletion will have a great impact on the eventalign signal there.
Best wishes, Kirito
Hello,
Would you be able to send me this read you are looking at (the FASTQ and the SLOW5)?
You can do a slow5tools get reads.blow5 readid -o readid.blow5
to get a particular read.
Also, let me know which reference you are using and I should download it.
If you provide a small test case like what you are showing would be really helpful. While I could create a small set myself, I really appreciate if you could share yours to start with as I am a bit pressed for finding time recently. Thanks.
Hi, @hasindu2008
I have uploaded the top 20 reads files as the test set. If you have any other requirements, please tell me.
mapping Step:
minimap2 -ax map-ont -L -un -k 8 -w 1 --secondary=no -t 36 -p 0.99 test.fa test.fastq | samtools view -F 2324 -b -@ 36 -o test.bam
Sam2tsv Step:
sam2tsv -R test.fa test.bam -o test.sam2tsv.tsv
Eventalign Step:
${f5c_dir}/f5c eventalign --reads test.fastq --bam test.bam --genome test.fa --slow5 test.blow5 -t 35 --min-mapq 0 --secondary=no --rna --signal-index --scale-events --collapse-events --samples -B 14M -K 2048 --cuda-dev-id 0 --summary test.summary.txt | pigz > test.eventalign.tsv.gz
Best wishes, Kirito
Thanks. I will have a look at this soon sometime later during this week when I get some time.
I had a look and the discrepancy between them is because the read-kmers are printed based on the initial read-to-event alignment step, done at the first stage of the f5c eventalign programme (called adaptive banded event alignment - see https://hasindu2008.github.io/f5c/nanopore_signal_alignment_supplementary_material.pdf for details). This initial alignment is a crude alignment, which is then refined to a more accurate alignment using HMM and the reference. Because this initial alignment is crude, it would be some times off from the actual alignment as you observe and have very little deletions. So perhaps what you are after is the read-kmers printed based on the CIGAR string (in BAM file), rather than this result from adaptive banded event alignment?
Hi, @hasindu2008
So perhaps what you are after is the read-kmers printed based on the CIGAR string (in BAM file)
Yes, I want to use the read-kmer based on the CIGAR string in bam file.
@kir1to455 Over the weekend, I got this CIGAR-based read-kmer thing implemented. Would you be able to have a look?
contig position reference_kmer read_name strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level read_pos read_kmer
E 10 TATAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 1 93.54 1.102 0.00299 TATAC 95.09 6.28 -0.23 0 TATAC
E 10 TATAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 2 92.40 1.473 0.00465 NNNNN 0.00 0.00 inf 0 TATAC
E 12 TACCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 3 84.25 6.243 0.00930 TACCT 79.02 2.81 1.70 2 TACCT
E 13 ACCTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 4 71.24 1.824 0.01461 ACCTT 72.21 2.35 -0.38 3 ACCTT
E 14 CCTTC 17ba731f-98f9-4d62-8660-b10c17235d65 t 5 69.37 0.907 0.00299 CCTTC 75.32 2.42 -2.25 4 CCTTC
E 15 CTTCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 6 71.33 1.592 0.00697 CTTCC 74.22 2.07 -1.27 5 CTTCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 7 65.91 1.122 0.01627 TTCCC 68.77 2.30 -1.14 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 8 67.31 0.867 0.00299 TTCCC 68.77 2.30 -0.58 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 9 66.40 1.004 0.01295 TTCCC 68.77 2.30 -0.94 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 10 73.45 1.951 0.01062 TTCCC 68.77 2.30 1.86 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 11 71.79 1.170 0.00232 TTCCC 68.77 2.30 1.20 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 12 73.47 1.554 0.00299 TTCCC 68.77 2.30 1.87 6 TTCCC
E 16 TTCCC 17ba731f-98f9-4d62-8660-b10c17235d65 t 13 67.69 0.744 0.00465 TTCCC 68.77 2.30 -0.43 6 TTCCC
E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 14 66.47 0.907 0.00498 TCCCA 64.83 2.07 0.72 7 TCCCT
E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 15 62.26 1.074 0.00498 TCCCA 64.83 2.07 -1.14 7 TCCCT
E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 16 68.51 2.112 0.00299 CCCAG 71.95 2.11 -1.49 8 CCCTG
E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 17 107.28 11.300 0.00764 NNNNN 0.00 0.00 inf 8 CCCTG
E 23 GTAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 18 86.77 2.427 0.01096 GTAAC 87.41 3.04 -0.19 12 GTAAC
E 24 TAACA 17ba731f-98f9-4d62-8660-b10c17235d65 t 19 93.93 2.873 0.00365 TAACA 93.83 3.45 0.03 13 TAACA
E 24 TAACA 17ba731f-98f9-4d62-8660-b10c17235d65 t 20 91.70 1.486 0.00199 TAACA 93.83 3.45 -0.57 13 TAACA
E 25 AACAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 21 87.51 1.707 0.00299 AACAA 87.82 3.18 -0.09 14 AACAA
E 25 AACAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 22 84.47 1.322 0.00332 AACAA 87.82 3.18 -0.96 14 AACAA
E 25 AACAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 23 86.27 1.391 0.00531 AACAA 87.82 3.18 -0.45 14 AACAA
E 26 ACAAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 24 81.30 1.865 0.00266 ACAAA 82.45 3.02 -0.35 15 ACAAA
E 26 ACAAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 25 83.27 1.521 0.00830 ACAAA 82.45 3.02 0.25 15 ACAAA
E 27 CAAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 26 105.75 3.077 0.01494 CAAAC 104.22 2.68 0.52 16 CAAAC
E 27 CAAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 27 84.11 1.725 0.01162 NNNNN 0.00 0.00 inf 16 CAAAC
E 29 AACCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 28 82.73 1.828 0.00564 AACCA 82.92 2.81 -0.06 18 AACCA
E 30 ACCAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 29 75.96 1.435 0.00830 ACCAA 74.58 2.11 0.60 19 ACCAA
E 31 CCAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 30 86.31 1.666 0.00299 CCAAC 85.16 3.02 0.35 20 CCAAC
E 31 CCAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 31 83.43 1.316 0.00232 CCAAC 85.16 3.02 -0.52 20 CCAAC
E 31 CCAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 32 88.68 1.904 0.00564 CCAAC 85.16 3.02 1.06 20 CCAAC
E 31 CCAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 33 84.48 1.887 0.00498 CCAAC 85.16 3.02 -0.21 20 CCAAC
E 32 CAACC 17ba731f-98f9-4d62-8660-b10c17235d65 t 34 90.35 3.208 0.00299 CAACC 89.37 3.45 0.26 21 CAACC
E 32 CAACC 17ba731f-98f9-4d62-8660-b10c17235d65 t 35 86.93 3.433 0.00498 CAACC 89.37 3.45 -0.65 21 CAACC
E 33 AACCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 36 80.52 1.474 0.00266 AACCA 82.92 2.81 -0.78 22 AACCA
E 34 ACCAA 17ba731f-98f9-4d62-8660-b10c17235d65 t 37 76.60 1.601 0.00830 ACCAA 74.58 2.11 0.88 23 ACCAT
E 35 CCAAC 17ba731f-98f9-4d62-8660-b10c17235d65 t 38 88.80 1.205 0.00232 CCAAC 85.16 3.02 1.10 24 CCATT
E 36 CAACT 17ba731f-98f9-4d62-8660-b10c17235d65 t 39 91.88 1.182 0.00697 CAACT 96.11 3.45 -1.12 25 CATTT
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 40 91.50 1.696 0.00797 AACTT 92.87 2.73 -0.46 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 41 93.03 1.487 0.00498 AACTT 92.87 2.73 0.05 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 42 97.74 1.844 0.00299 AACTT 92.87 2.73 1.63 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 43 91.90 1.740 0.00299 AACTT 92.87 2.73 -0.32 . .
E 38 ACTTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 44 81.34 1.426 0.00465 ACTTT 82.38 2.42 -0.39 26 ATTTT
E 39 CTTTC 17ba731f-98f9-4d62-8660-b10c17235d65 t 45 77.98 1.227 0.00398 CTTTC 77.72 1.97 0.12 27 TTTTC
E 40 TTTCG 17ba731f-98f9-4d62-8660-b10c17235d65 t 46 80.87 1.196 0.00697 TTTCG 79.07 2.07 0.79 28 TTTCG
E 40 TTTCG 17ba731f-98f9-4d62-8660-b10c17235d65 t 47 79.44 2.088 0.00664 TTTCG 79.07 2.07 0.17 28 TTTCG
E 41 TTCGA 17ba731f-98f9-4d62-8660-b10c17235d65 t 48 87.37 2.478 0.00697 TTCGA 84.70 2.30 1.06 29 TTCGA
E 42 TCGAT 17ba731f-98f9-4d62-8660-b10c17235d65 t 49 99.04 3.304 0.00266 TCGAT 99.04 3.99 0.00 30 TCGAT
E 42 TCGAT 17ba731f-98f9-4d62-8660-b10c17235d65 t 50 103.69 1.882 0.00398 TCGAT 99.04 3.99 1.07 30 TCGAT
E 43 CGATC 17ba731f-98f9-4d62-8660-b10c17235d65 t 51 117.71 6.769 0.01627 CGATC 117.72 5.10 -0.00 1 ATACC
E 44 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 52 98.72 2.460 0.00465 GATCT 95.38 5.70 0.53 2 TACCT
E 44 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 53 90.61 4.844 0.00266 GATCT 95.38 5.70 -0.77 2 TACCT
E 44 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 54 83.23 4.277 0.00564 GATCT 95.38 5.70 -1.95 2 TACCT
E 45 ATCTC 17ba731f-98f9-4d62-8660-b10c17235d65 t 55 75.25 1.181 0.00498 ATCTC 76.56 2.07 -0.58 3 ACCTT
E 45 ATCTC 17ba731f-98f9-4d62-8660-b10c17235d65 t 56 75.89 1.135 0.00299 ATCTC 76.56 2.07 -0.30 3 ACCTT
E 46 TCTCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 57 78.40 1.240 0.00797 TCTCT 79.44 2.85 -0.34 4 CCTTC
E 47 CTCTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 58 80.19 0.976 0.01793 CTCTT 79.01 2.07 0.52 5 CTTCC
E 47 CTCTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 59 81.89 0.647 0.00266 CTCTT 79.01 2.07 1.28 5 CTTCC
E 47 CTCTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 60 78.30 1.424 0.00664 CTCTT 79.01 2.07 -0.32 5 CTTCC
E 48 TCTTG 17ba731f-98f9-4d62-8660-b10c17235d65 t 61 85.21 1.177 0.00498 TCTTG 80.89 2.42 1.63 6 TTCCC
E 48 TCTTG 17ba731f-98f9-4d62-8660-b10c17235d65 t 62 82.00 1.307 0.00365 TCTTG 80.89 2.42 0.42 6 TTCCC
E 49 CTTGT 17ba731f-98f9-4d62-8660-b10c17235d65 t 63 86.62 2.146 0.00896 CTTGT 86.77 2.65 -0.05 7 TCCCT
E 50 TTGTA 17ba731f-98f9-4d62-8660-b10c17235d65 t 64 98.97 1.391 0.00332 TTGTA 99.35 3.78 -0.09 8 CCCTG
E 51 TGTAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 65 106.88 7.324 0.00896 TGTAG 106.93 8.56 -0.01 9 CCTGT
E 52 GTAGA 17ba731f-98f9-4d62-8660-b10c17235d65 t 66 100.77 5.392 0.00498 GTAGA 102.41 3.78 -0.40 10 CTGTA
E 52 GTAGA 17ba731f-98f9-4d62-8660-b10c17235d65 t 67 108.24 2.692 0.00332 GTAGA 102.41 3.78 1.41 10 CTGTA
E 53 TAGAT 17ba731f-98f9-4d62-8660-b10c17235d65 t 68 131.21 1.155 0.00365 TAGAT 126.66 5.87 0.71 . .
E 54 AGATC 17ba731f-98f9-4d62-8660-b10c17235d65 t 69 139.84 4.096 0.00432 AGATC 134.08 5.10 1.03 11 TGTAA
E 55 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 70 107.10 7.655 0.00697 GATCT 95.38 5.70 1.88 12 GTAAC
E 55 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 71 92.67 4.136 0.00863 GATCT 95.38 5.70 -0.44 12 GTAAC
E 55 GATCT 17ba731f-98f9-4d62-8660-b10c17235d65 t 72 86.46 4.580 0.00432 GATCT 95.38 5.70 -1.43 12 GTAAC
E 56 ATCTG 17ba731f-98f9-4d62-8660-b10c17235d65 t 73 75.55 1.159 0.00299 ATCTG 77.11 2.07 -0.69 13 TAACA
E 56 ATCTG 17ba731f-98f9-4d62-8660-b10c17235d65 t 74 74.25 0.992 0.00266 ATCTG 77.11 2.07 -1.27 13 TAACA
E 57 TCTGT 17ba731f-98f9-4d62-8660-b10c17235d65 t 75 84.84 1.324 0.00398 TCTGT 84.95 2.85 -0.04 14 AACAA
Hi, @hasindu2008 Thank you for your efforts. There still seem to be some problems. In bam file, read "17ba731f-98f9-4d62-8660-b10c17235d65" in position 22 has a Deletion.
In eventalign(0-based) , there should be a deletion at position 21.
So in this line,
E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 14 66.47 0.907 0.00498 TCCCA 64.83 2.07 0.72 7 TCCCT
E 17 TCCCA 17ba731f-98f9-4d62-8660-b10c17235d65 t 15 62.26 1.074 0.00498 TCCCA 64.83 2.07 -1.14 7 TCCCT
readkmer should be TCCC. , "." means deletion. Because base A is deletion here. (Or you can use "" presents deletion, I see that the modkit of nanopore does this)
Similarly,
E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 16 68.51 2.112 0.00299 CCCAG 71.95 2.11 -1.49 8 CCCTG
E 18 CCCAG 17ba731f-98f9-4d62-8660-b10c17235d65 t 17 107.28 11.300 0.00764 NNNNN 0.00 0.00 inf 8 CCCTG
read_kmer should be CCC.G .
Best wishes, Kirito
@kir1to455 I had a look over the weekend and due to the way that f5c internally keeps metadata, it is a lot of work to implement such a kmer format. The reason is, f5c internally just keeps the k-mer position (index) on the read and then when printing the output it prints the relevant k-mer that starts from the given index. Is it possible for you to use the "." in the read k-mer position to deduce the necessary format?
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 40 91.50 1.696 0.00797 AACTT 92.87 2.73 -0.46 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 41 93.03 1.487 0.00498 AACTT 92.87 2.73 0.05 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 42 97.74 1.844 0.00299 AACTT 92.87 2.73 1.63 . .
E 37 AACTT 17ba731f-98f9-4d62-8660-b10c17235d65 t 43 91.90 1.740 0.00299 AACTT 92.87 2.73 -0.32 . .
For instance, "." in this case says that the position incurs a deletion.
Alternatively,
Hi, @hasindu2008
It seems hard to use "." in the read k-mer position to deduce the format.
The reason is that if eventalign didn't have the position, it will not output the "." in read k-mer.
At position 37,it had a deletion. So I can deduce the foramt.
Similarly, at position 21, it also had a deletion. But eventalign didn't have the position. So I couldn't know In position 17 the motif should be TCCCA
or TCCC.
Thank you for your help! If f5c is hard to implement this function, then forget it. Best wishes, Kirito
Oh, I see what you mean. I think the way to go ahead in f5c to implement this would be to add an additional step that parses the cigar string that deduce the format that you require.
Is this for a mind-storming/prototyping project you are requiring this option or is it that you already have a working script but that wants something faster? If it is the former, I would suggest that you write a script that parses the CIGAR in the BAM file in combination with the event align TSV to print the custom output. If it is later, I am happy to put the effort to implement this into f5c as it is worth the effort if it is an already working pipeline that needs performance improvement.
Hi, @hasindu2008 I already have a python script to parse the CIGAR in the BAM file. I used sam2tsv to output the bases at all positions. However, for each line (each read and position), it needs to iterate through the entire read to determine what the base is at this position. This adds a lot of unnecessary overhead. I believe C language can handle this better.
Best wishes, Kirito
From your description, I could not fully understand. If you email me the part of the algorithm that does this to hasindu@unsw.edu.au or put it here, I can see if I can understand. Also, for this read "17ba731f-98f9-4d62-8660-b10c17235d65" do you have the intended output? How do you handle the insertions?
Hi, @hasindu2008
I have sent the code to your email, please check.
@kir1to455 sorry for the delay. Finally got to implement with the help of the email you sent.
Could you check the output below? It prints the 5-mers with . b.txt
Hi,@hasindu2008 The result looks fine. In position 36, it output the "C.ATT".This is very good. In position 37, The read kmer should be .ATTT. But the script seems to always replace the deletion of the first position with “.". I want to output the complete read-kmer for these location.
Best wishes, Kirito
ah. See if this looks good? a.txt
Hi, @hasindu2008 Nice results! Thanks for your help! I hope this feature can be added to a new branch! Looking forward to the new releases!
Best wishes, Kirito
It is currently in this branch, which you can give a try: https://github.com/hasindu2008/f5c/tree/read_kmer
Hi, @hasindu2008 I try https://github.com/hasindu2008/f5c/tree/read_kmer with RNA004 data. I want to know if this function works for the option "--pore rna004" with 9-kmer ? I seem to be getting different results than expected.
Best wishes, Kirito
Hey,
I don't think I ever tested on RNA004 data. If you could give me a tint example like last time, I could have a look over the weekend.
Thanks
Hi @hasindu2008 , Thank you for developing f5c ! I wonder know whether f5c eventalign can output basecalling kmer? Generally, f5c eventalign will output the following header files.
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx samples transcript 9 TTATA 0 t 27 93.44 2.558 0.00465 TTATA 93.29 2.63 0.05 143742 143756 94.4323,94.7009,93.4921,94.5666,96.3125,94.7009,96.1782,92.9549,93.6264,94.8352,91.4775,94.8352,89.4629,86.6424
Whether a column basecalling kmer can be added at the end? Like this contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx samples basecalling_kmer transcript 9 TTATA 0 t 27 93.44 2.558 0.00465 TTATA 93.29 2.63 0.05 143742 143756 94.4323,94.7009,93.4921,94.5666,96.3125,94.7009,96.1782,92.9549,93.6264,94.8352,91.4775,94.8352,89.4629,86.6424 TTATA
I think the reason for this is that basecalling error occurs in fastq files. (Like A-->T mutation, or deletion and so on) Take a example. Here is my eventalign file. Here is the tsv file I generated with sam2tsv.
In position 39, the reference kmer is CTTTC. But its sequence in fastq file is TTTTC.
Best wishes, Kirito