nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
144 stars 8 forks source link

Questions Regarding the Counts of Nmod and Nnocall #160

Closed hee014789 closed 6 months ago

hee014789 commented 7 months ago

Hello, I'm a student currently working on my master's project. Given that our lab is taking its first steps into nanopore sequencing, we're navigating somewhat blindly and lack sufficient resources for guidance, leading me to seek insights here.

I've used modkit pileup to generate a bed file and I have some questions regarding the counts of Nmod and Nnocall.

Observing the results of the pileup, I notice that the numbers of Nmod and Ncanonical are mostly 0 or 1. Even when I lower the C filter threshold from 0.9 to 0.6, there's no change. It appears that the number of Nnocall is similar to the number of mapped reads, which suggests to me that among over 70 reads mapped to a site, possibly only one read exceeds the 0.6 probability threshold for a modified C. Is my understanding correct? I would like to increase the counts of Nmod or Ncanonical for the sake of reliability in further analysis. Could you advise me on which options I need to adjust to achieve this? Below are options I've used in my code and a part of the bed file it generated.

--log-filepath /path/to/logfile.log --only-tabs -f 1.0 --filter-threshold C:0.9 --motif C 0 --ref /path/to/reference_genome.fa

MT 410 411 h 1 + 410 411 255,0,0 1 100.00 1 0 0 2 0 0 76 MT 410 411 m 1 + 410 411 255,0,0 1 0.00 0 0 1 2 0 0 76 MT 411 412 h 1 - 411 412 255,0,0 1 100.00 1 0 0 1 0 1 84 MT 411 412 m 1 - 411 412 255,0,0 1 0.00 0 0 1 1 0 1 84

ArtRand commented 7 months ago

Hello @hee014789,

Could you tell me which modified base model you used to generate the mod-BAM? The threshold will not change the N_nocall column, it will only change the N_valid_cov and N_fail columns (see the documentation for all the details).

If you can make an genome browser screen shot of one of the sections that might help as well.

hee014789 commented 7 months ago

Thank you for your response. The BAM file was generated using guppy's res_dna_r941_min_modbases_5mC_5hmC_v001.cfg to detect both 5mC and 5hmC. Below is a screenshot from IGV, where 5mC is marked in pink and 5hmC in yellow. I'm not sure if this is exactly what you're looking for.

Screenshot 2024-04-12 at 09 03 37
ArtRand commented 7 months ago

Hello @hee014789,

Thanks. I was hoping to see what a genome column looks like zoomed in. So if you put MT:410-411 into IGV, get a picture. If you could even attach a BAM with some sample data demonstrating the issue I can help figure out what's going on.

hee014789 commented 7 months ago

I'm sorry for the inconvenience. Here's an IGV screenshot of the same sample data. Although there are more reads than captured in the screenshot, only two are shown with the yellow modification tag (5hmC).

Screenshot 2024-04-12 at 17 07 36
ArtRand commented 7 months ago

Hello @hee014789,

No worries. I was basically looking for anything that seemed "out of sorts" but this IGV shot you have looks pretty normal to me. It appears that most of the reads simply don't have a modification call at that position so seeing them counted in N_nocall makes sense. I don't see a corresponding 5hmC/5mC (all-context) R9 model in dorado or rerio, however if you're willing to only look at CG contexts, there are 5hmCG/5mCG models (e.g. dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0) available for dorado which might be worth a try. If you send a BAM I can take a look at the tags, there is a possibility that it's an old format-related bug.

hee014789 commented 7 months ago

Thank you very much for your reply. I have a very basic question about each term. Let's assume that at filter-threshold: C=0.9, the information on one site is as follows. N_mod=1, N_canonical=2, N_fail=3, and N_nocall=30. Does this mean that among all the reads aligned at this position, there is 1 read with a modified probability of 0.9 or higher, 3 reads with a modified probability of less than 0.9, and 2 reads with a certain probability of being canonical (unmodified)? So, are the 30 reads corresponding to N_nocall those whose unmodified probability is below a certain value, or are they those for which there is no information about whether they are modified or unmodified? I have read the definition of N_nocall, but I don't understand it well, so I am asking a question.

Violeta-de-Anca commented 6 months ago

Hi, I am the PhD who performed the basecalling, indeed with that model I had to update the tags. Sadly I chose that model even if is old cause we want to look at all context modifications of C. As we are not very interested on motifs we are running now modkit to get the bedMethyl files as:

modkit pileup --log-filepath pileup.individual_auto -f 1 --motif C 0 --ref Gallus_gallus.WASHUC2.57.dna.chromosome.MT.fa --filter-threshold 0.8 $i $x.0.8.threshold.bed

The output looks like this:

MT      68      69      m       2       +       68      69      255,0,0 2 100.00 2 0 0 107 16 32 1597
MT      74      75      m       9       +       74      75      255,0,0 9 44.44 4 3 2 49 1 27 1668
MT      109     110     m       1       +       109     110     255,0,0 1 0.00 0 1 0 130 1 38 1589
MT      110     111     m       3       +       110     111     255,0,0 3 100.00 3 0 0 22 17 83 1634
MT      140     141     m       1       +       140     141     255,0,0 1 0.00 0 1 0 8 2 35 1717
MT      157     158     m       2       +       157     158     255,0,0 2 0.00 0 0 2 14 2 9 1737
MT      236     237     m       1       +       236     237     255,0,0 1 0.00 0 0 1 9 0 17 1743
MT      253     254     m       1       +       253     254     255,0,0 1 100.00 1 0 0 21 0 42 1706
MT      278     279     m       1       +       278     279     255,0,0 1 0.00 0 1 0 62 0 36 1672
MT      326     327     m       1       +       326     327     255,0,0 1 0.00 0 1 0 28 0 14 1730
MT      339     340     m       1       +       339     340     255,0,0 1 0.00 0 1 0 50 1 41 1680
MT      348     349     m       1       +       348     349     255,0,0 1 0.00 0 1 0 15 0 19 1738
MT      385     386     m       2       +       385     386     255,0,0 2 100.00 2 0 0 42 3 35 1691
MT      388     389     m       7       +       388     389     255,0,0 7 71.43 5 2 0 17 11 112 1626

One of our main questions is why we have such low canonical callings, is true that when looking at the probabilities distribution histogram for C it ranges from 0.5 to 0.99, so we expect to have some amount in the fail column, but not the majority placed in the no call field. See the histrogram of probabilities distribution for cytosine: image

Could we count the no call column also in the total amount of C that has no modification? With the definition that is provided in the manual Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller. and as there is no motif, is only one nucleotide we were wondering if modkit is just classifying all the canonical Cs in the nocall column. I am also attaching a subset of the bam file that was used to generate the bed file. As bam format is not supported I attach it as a txt.

subset_good.txt

Thank you so much!

Hello @hee014789,

No worries. I was basically looking for anything that seemed "out of sorts" but this IGV shot you have looks pretty normal to me. It appears that most of the reads simply don't have a modification call at that position so seeing them counted in N_nocall makes sense. I don't see a corresponding 5hmC/5mC (all-context) R9 model in dorado or rerio, however if you're willing to only look at CG contexts, there are 5hmCG/5mCG models (e.g. dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0) available for dorado which might be worth a try. If you send a BAM I can take a look at the tags, there is a possibility that it's an old format-related bug.

ArtRand commented 6 months ago

Hello @Violeta-de-Anca and @hee014789,

Sorry for the delay, starting with @hee014789's previous question, let's use your example:

filter threshold, C=0.9 N_mod=1 N_canonical=2 N_fail=3 N_nocall=30

This means:

Now to @Violeta-de-Anca's question and addendum.

Seeing that most of the reads are being put into N-nocall makes me think there is a problem with how the tags were handled. Could you tell me the command (and version of the tools) you used to convert or update the tags? The output probability distribution looks like what I would expect (given the legend, looks like you're doing some interesting experiments btw!).

Could we count the no call column also in the total amount of C that has no modification?

No this would be incorrect. Please either give me all the commands run on the basecalls or send me a sample of the basecalls (before anything is done to them) so I can inspect them.

In the SAM file sample you sent, both of the reads have MM tags like this: MM:Z:C+C?;, which simply means that there aren't any modification calls on that read, so I would expect that these reads would contribute to N_nocall.

My current guess is there is a bug or misstep in converting the tags from the original basecalls. Could you send me a sample of those (with the BAM header) please?

Violeta-de-Anca commented 6 months ago

Hi, Thank you so much for your fast response! The version of modkit with which I did the update of the tags is: mod_kit 0.1.7 but I am now using mod_kit 0.2.5 in a server. The command used for updating the tags was

for i in $(cat ./sorted.list);
do
        modkit update-tags $i ${i%.bam}.implicit.bam --mode ambiguous
done

As when I tried the first time it was giving me this error (extracted from the log file):

[src/read_cache.rs::341][2023-05-30 18:47:33][DEBUG] read bbe1502f-754c-469e-b4fa-1adf2e267df0, Skipped: record bbe1502f-754c-469e-b4fa-1adf2e267df0 has un-allowed mode (ImplicitProbModified), use '--force-allow-implicit' or 'modkit update-tags --mode ambiguous'

Here I attach a subset of the bam files before the update of the tags: subset_no_implicit.txt

And here is a subset of the bams after the update of the tags: subset_implicit.txt

One thing that now I wonder, I always understood as canonical, in the context of 5mC, a cytosine that is not modified, but has passed the threshold, but giving the explanation I am doubting if I have misunderstood what is canonical this whole time?

Thank you so much for your help! Violeta de Anca

ArtRand commented 6 months ago

Hello @Violeta-de-Anca,

There is something wrong with updating the tags. I'm trying to figure it out on my side. Could you attach the subset of reads using samtools view -bh so that the attached files can be used by modkit/samtools?

Thanks.

Violeta-de-Anca commented 6 months ago

Hi,

Yes here is an extract (even if is .txt is a bam file, but I couldn't attach as .bam) of the bam file before the updating of the tags: subset_bh_samtools.txt And here of the same file after updating the tags: subset_bh_samtools_updated_tags.txt

Thank you so much for your help! Violeta de Anca

ArtRand commented 6 months ago

Hello @Violeta-de-Anca,

Looks like these files have been truncated or corrupted in some way.

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 1602 expected 33189 bytes; hread returned 1237

One way to make the subset is to use this script:

samtools view -F 2308 ${all_reads} MT:68-389 | cut -f 1 | head -n 10 > read_ids.txt
samtools view -bh ${all_reads} MT:68-389 -N read_ids.txt > subset_reads.bam

# this part is just so you can upload it to github
zip subset_reads.bam.zip subset_reads.bam 

Then send me subset_reads.bam.zip, ${all_reads} is all of your mapped reads. You're not blocking me figuring it out on my side, but I'd like to be certain that the fix (when I find it) works on your data.

Violeta-de-Anca commented 6 months ago

Hi, I am so sorry, here they are the files following your commands: Before updating the tags: subset_before_updating.bam.zip After updating the tags: subset_after_updating.bam.zip

Sorry for the trouble! Violeta

ArtRand commented 6 months ago

Hello @Violeta-de-Anca and @hee014789,

Thanks for sending the data over. You've uncovered a bug in modkit update-tags, and I appreciate you finding it! I've attached a build to this thread and pushed the branch to github. I've also attached the file you sent after running modkit update-tags then modkit pileup --no-filtering, I think the results are looking more like what you would expect.

I'm going to make a release by the end of the week with this fix and a few others, so I think you can confidently use this build then switch to the latest release when it comes out. Of course, keep me informed if you find any other problems.

Thanks. result_files.zip modkit_deveded4ed_centos7_x86_64.tar.gz

Violeta-de-Anca commented 6 months ago

Thank you so much!

I have rerun the modkit update-tags as:

modkit_deveded4ed_centos7_x86_64/dist/modkit update-tags --mode implicit $i ${i%.bam}.implicit.bam

And then the pileup like:

modkit pileup --log-filepath pileup.individual_auto -f 1 --motif C 0 --ref Gallus_gallus.WASHUC2.57.dna.chromosome.MT.fa --filter-threshold 0.8 $i $x.0.8.threshold.bed

And the output looked so much better, now all the canonical C's are on their right place:

MT      15      16      h       5       +       15      16      255,0,0 5 0.00 0 5 0 2 0 0 0
MT      15      16      m       5       +       15      16      255,0,0 5 0.00 0 5 0 2 0 0 0
MT      16      17      h       6       +       16      17      255,0,0 6 0.00 0 6 0 1 0 0 0
MT      16      17      m       6       +       16      17      255,0,0 6 0.00 0 6 0 1 0 0 0
MT      20      21      h       7       +       20      21      255,0,0 7 0.00 0 7 0 0 0 0 0

Thank you so much for the help! Violeta de Anca

ArtRand commented 6 months ago

Fixed with v0.2.8-rc1