Closed hee014789 closed 6 months ago
Hello @hee014789,
Could you tell me which modified base model you used to generate the mod-BAM? The threshold will not change the N_nocall
column, it will only change the N_valid_cov
and N_fail
columns (see the documentation for all the details).
If you can make an genome browser screen shot of one of the sections that might help as well.
Thank you for your response. The BAM file was generated using guppy's res_dna_r941_min_modbases_5mC_5hmC_v001.cfg to detect both 5mC and 5hmC. Below is a screenshot from IGV, where 5mC is marked in pink and 5hmC in yellow. I'm not sure if this is exactly what you're looking for.
Hello @hee014789,
Thanks. I was hoping to see what a genome column looks like zoomed in. So if you put MT:410-411
into IGV, get a picture. If you could even attach a BAM with some sample data demonstrating the issue I can help figure out what's going on.
I'm sorry for the inconvenience. Here's an IGV screenshot of the same sample data. Although there are more reads than captured in the screenshot, only two are shown with the yellow modification tag (5hmC).
Hello @hee014789,
No worries. I was basically looking for anything that seemed "out of sorts" but this IGV shot you have looks pretty normal to me. It appears that most of the reads simply don't have a modification call at that position so seeing them counted in N_nocall
makes sense. I don't see a corresponding 5hmC/5mC (all-context) R9 model in dorado or rerio, however if you're willing to only look at CG contexts, there are 5hmCG/5mCG models (e.g. dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0
) available for dorado which might be worth a try. If you send a BAM I can take a look at the tags, there is a possibility that it's an old format-related bug.
Thank you very much for your reply. I have a very basic question about each term. Let's assume that at filter-threshold: C=0.9, the information on one site is as follows. N_mod=1, N_canonical=2, N_fail=3, and N_nocall=30. Does this mean that among all the reads aligned at this position, there is 1 read with a modified probability of 0.9 or higher, 3 reads with a modified probability of less than 0.9, and 2 reads with a certain probability of being canonical (unmodified)? So, are the 30 reads corresponding to N_nocall those whose unmodified probability is below a certain value, or are they those for which there is no information about whether they are modified or unmodified? I have read the definition of N_nocall, but I don't understand it well, so I am asking a question.
Hi, I am the PhD who performed the basecalling, indeed with that model I had to update the tags. Sadly I chose that model even if is old cause we want to look at all context modifications of C. As we are not very interested on motifs we are running now modkit to get the bedMethyl files as:
modkit pileup --log-filepath pileup.individual_auto -f 1 --motif C 0 --ref Gallus_gallus.WASHUC2.57.dna.chromosome.MT.fa --filter-threshold 0.8 $i $x.0.8.threshold.bed
The output looks like this:
MT 68 69 m 2 + 68 69 255,0,0 2 100.00 2 0 0 107 16 32 1597
MT 74 75 m 9 + 74 75 255,0,0 9 44.44 4 3 2 49 1 27 1668
MT 109 110 m 1 + 109 110 255,0,0 1 0.00 0 1 0 130 1 38 1589
MT 110 111 m 3 + 110 111 255,0,0 3 100.00 3 0 0 22 17 83 1634
MT 140 141 m 1 + 140 141 255,0,0 1 0.00 0 1 0 8 2 35 1717
MT 157 158 m 2 + 157 158 255,0,0 2 0.00 0 0 2 14 2 9 1737
MT 236 237 m 1 + 236 237 255,0,0 1 0.00 0 0 1 9 0 17 1743
MT 253 254 m 1 + 253 254 255,0,0 1 100.00 1 0 0 21 0 42 1706
MT 278 279 m 1 + 278 279 255,0,0 1 0.00 0 1 0 62 0 36 1672
MT 326 327 m 1 + 326 327 255,0,0 1 0.00 0 1 0 28 0 14 1730
MT 339 340 m 1 + 339 340 255,0,0 1 0.00 0 1 0 50 1 41 1680
MT 348 349 m 1 + 348 349 255,0,0 1 0.00 0 1 0 15 0 19 1738
MT 385 386 m 2 + 385 386 255,0,0 2 100.00 2 0 0 42 3 35 1691
MT 388 389 m 7 + 388 389 255,0,0 7 71.43 5 2 0 17 11 112 1626
One of our main questions is why we have such low canonical callings, is true that when looking at the probabilities distribution histogram for C it ranges from 0.5 to 0.99, so we expect to have some amount in the fail column, but not the majority placed in the no call field. See the histrogram of probabilities distribution for cytosine:
Could we count the no call column also in the total amount of C that has no modification? With the definition that is provided in the manual Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller. and as there is no motif, is only one nucleotide we were wondering if modkit is just classifying all the canonical Cs in the nocall column. I am also attaching a subset of the bam file that was used to generate the bed file. As bam format is not supported I attach it as a txt.
Thank you so much!
Hello @hee014789,
No worries. I was basically looking for anything that seemed "out of sorts" but this IGV shot you have looks pretty normal to me. It appears that most of the reads simply don't have a modification call at that position so seeing them counted in
N_nocall
makes sense. I don't see a corresponding 5hmC/5mC (all-context) R9 model in dorado or rerio, however if you're willing to only look at CG contexts, there are 5hmCG/5mCG models (e.g.dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0
) available for dorado which might be worth a try. If you send a BAM I can take a look at the tags, there is a possibility that it's an old format-related bug.
Hello @Violeta-de-Anca and @hee014789,
Sorry for the delay, starting with @hee014789's previous question, let's use your example:
filter threshold, C=0.9 N_mod=1 N_canonical=2 N_fail=3 N_nocall=30
This means:
Now to @Violeta-de-Anca's question and addendum.
Seeing that most of the reads are being put into N-nocall
makes me think there is a problem with how the tags were handled. Could you tell me the command (and version of the tools) you used to convert or update the tags? The output probability distribution looks like what I would expect (given the legend, looks like you're doing some interesting experiments btw!).
Could we count the no call column also in the total amount of C that has no modification?
No this would be incorrect. Please either give me all the commands run on the basecalls or send me a sample of the basecalls (before anything is done to them) so I can inspect them.
In the SAM file sample you sent, both of the reads have MM tags like this: MM:Z:C+C?;
, which simply means that there aren't any modification calls on that read, so I would expect that these reads would contribute to N_nocall
.
My current guess is there is a bug or misstep in converting the tags from the original basecalls. Could you send me a sample of those (with the BAM header) please?
Hi, Thank you so much for your fast response! The version of modkit with which I did the update of the tags is: mod_kit 0.1.7 but I am now using mod_kit 0.2.5 in a server. The command used for updating the tags was
for i in $(cat ./sorted.list);
do
modkit update-tags $i ${i%.bam}.implicit.bam --mode ambiguous
done
As when I tried the first time it was giving me this error (extracted from the log file):
[src/read_cache.rs::341][2023-05-30 18:47:33][DEBUG] read bbe1502f-754c-469e-b4fa-1adf2e267df0, Skipped: record bbe1502f-754c-469e-b4fa-1adf2e267df0 has un-allowed mode (ImplicitProbModified), use '--force-allow-implicit' or 'modkit update-tags --mode ambiguous'
Here I attach a subset of the bam files before the update of the tags: subset_no_implicit.txt
And here is a subset of the bams after the update of the tags: subset_implicit.txt
One thing that now I wonder, I always understood as canonical, in the context of 5mC, a cytosine that is not modified, but has passed the threshold, but giving the explanation I am doubting if I have misunderstood what is canonical this whole time?
Thank you so much for your help! Violeta de Anca
Hello @Violeta-de-Anca,
There is something wrong with updating the tags. I'm trying to figure it out on my side. Could you attach the subset of reads using samtools view -bh
so that the attached files can be used by modkit/samtools?
Thanks.
Hi,
Yes here is an extract (even if is .txt is a bam file, but I couldn't attach as .bam) of the bam file before the updating of the tags: subset_bh_samtools.txt And here of the same file after updating the tags: subset_bh_samtools_updated_tags.txt
Thank you so much for your help! Violeta de Anca
Hello @Violeta-de-Anca,
Looks like these files have been truncated or corrupted in some way.
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 1602 expected 33189 bytes; hread returned 1237
One way to make the subset is to use this script:
samtools view -F 2308 ${all_reads} MT:68-389 | cut -f 1 | head -n 10 > read_ids.txt
samtools view -bh ${all_reads} MT:68-389 -N read_ids.txt > subset_reads.bam
# this part is just so you can upload it to github
zip subset_reads.bam.zip subset_reads.bam
Then send me subset_reads.bam.zip
, ${all_reads}
is all of your mapped reads. You're not blocking me figuring it out on my side, but I'd like to be certain that the fix (when I find it) works on your data.
Hi, I am so sorry, here they are the files following your commands: Before updating the tags: subset_before_updating.bam.zip After updating the tags: subset_after_updating.bam.zip
Sorry for the trouble! Violeta
Hello @Violeta-de-Anca and @hee014789,
Thanks for sending the data over. You've uncovered a bug in modkit update-tags
, and I appreciate you finding it! I've attached a build to this thread and pushed the branch to github. I've also attached the file you sent after running modkit update-tags
then modkit pileup --no-filtering
, I think the results are looking more like what you would expect.
I'm going to make a release by the end of the week with this fix and a few others, so I think you can confidently use this build then switch to the latest release when it comes out. Of course, keep me informed if you find any other problems.
Thanks. result_files.zip modkit_deveded4ed_centos7_x86_64.tar.gz
Thank you so much!
I have rerun the modkit update-tags as:
modkit_deveded4ed_centos7_x86_64/dist/modkit update-tags --mode implicit $i ${i%.bam}.implicit.bam
And then the pileup like:
modkit pileup --log-filepath pileup.individual_auto -f 1 --motif C 0 --ref Gallus_gallus.WASHUC2.57.dna.chromosome.MT.fa --filter-threshold 0.8 $i $x.0.8.threshold.bed
And the output looked so much better, now all the canonical C's are on their right place:
MT 15 16 h 5 + 15 16 255,0,0 5 0.00 0 5 0 2 0 0 0
MT 15 16 m 5 + 15 16 255,0,0 5 0.00 0 5 0 2 0 0 0
MT 16 17 h 6 + 16 17 255,0,0 6 0.00 0 6 0 1 0 0 0
MT 16 17 m 6 + 16 17 255,0,0 6 0.00 0 6 0 1 0 0 0
MT 20 21 h 7 + 20 21 255,0,0 7 0.00 0 7 0 0 0 0 0
Thank you so much for the help! Violeta de Anca
Fixed with v0.2.8-rc1
Hello, I'm a student currently working on my master's project. Given that our lab is taking its first steps into nanopore sequencing, we're navigating somewhat blindly and lack sufficient resources for guidance, leading me to seek insights here.
I've used modkit pileup to generate a bed file and I have some questions regarding the counts of Nmod and Nnocall.
Observing the results of the pileup, I notice that the numbers of Nmod and Ncanonical are mostly 0 or 1. Even when I lower the C filter threshold from 0.9 to 0.6, there's no change. It appears that the number of Nnocall is similar to the number of mapped reads, which suggests to me that among over 70 reads mapped to a site, possibly only one read exceeds the 0.6 probability threshold for a modified C. Is my understanding correct? I would like to increase the counts of Nmod or Ncanonical for the sake of reliability in further analysis. Could you advise me on which options I need to adjust to achieve this? Below are options I've used in my code and a part of the bed file it generated.
--log-filepath /path/to/logfile.log --only-tabs -f 1.0 --filter-threshold C:0.9 --motif C 0 --ref /path/to/reference_genome.fa
MT 410 411 h 1 + 410 411 255,0,0 1 100.00 1 0 0 2 0 0 76 MT 410 411 m 1 + 410 411 255,0,0 1 0.00 0 0 1 2 0 0 76 MT 411 412 h 1 - 411 412 255,0,0 1 100.00 1 0 0 1 0 1 84 MT 411 412 m 1 - 411 412 255,0,0 1 0.00 0 0 1 1 0 1 84