nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
196 stars 30 forks source link

Allow split read alignments #71

Closed franrodalg closed 3 years ago

franrodalg commented 3 years ago

Hi @marcus1487

I wonder if it could be easy to add info about the position in the read (aside from the position in the reference) for the per-read variant and modification calls. This would be very useful when working on repetitive regions, since I don't think there's currently any way of distinguishing various calls from the same read.

Cheers, Fran

marcus1487 commented 3 years ago

Megalodon provides both reference-anchored modified base calls (--outputs per_read_mods mod_mappings outputs) as well as basecall anchored modified base calls (--outputs mod_basecalls output).

The basecall anchored modified base scores have positions linked to the original basecalls, but not to the reference. These basecalls could be mapped to the reference and the scores transferred to reference positions, but there are issues with this pipeline. Importantly, this is not how Megalodon processes a read. Megalodon uses the basecalls to determine the reference mapping sequence and to align the raw signal to the reference bases. After this the basecalls are discarded for further processing. This produces more accurate modified base calls than simply transferring the basecall-anchored modified base scores onto the reference. Thus the final reference-anchored modified base scores don't actually have a mapping back onto the original basecalls for the read. It is technically possible to extract the rough stretch of signal assigned to a modified base call, but there is not currently a nice output for this type of data (this would be a very large file).

I hope this helps to add some context to this question. Please let me know if this answer is sufficient or if further information would help.

franrodalg commented 3 years ago

I think I understand what you mean, and I am not sure what would be the best option. In reality, I wouldn't need the exact position or the "stretch of signal", but "simply" a way of distinguishing multiple calls corresponding to the same reference position across a read, hopefully in a way that indicates their arrangement (i.e., which call appears upstream of which other one). The paf file minimap2 generates by default includes information on where in the read the alignment comes from. This, unfortunately, seems to be lost when using a bam output, but it maybe could be leveraged anyway: no need to specify the exact position in the read but just indicate to which alignment that call belongs to (i.e., adding the query start and end columns of the minimap2 alignment to the per-read mod and variant files).

Would that make any sense?

marcus1487 commented 3 years ago

I think there might be some confusion concerning the terminology. An individual read can only have a single call per modified base at any reference position. Separate calls of the same modified base along a read will be assigned to different reference bases. Given this terminology I'm a bit confused at what you mean by "multiple calls corresponding to the same reference position across a read".

One output that might contain the requested information is the mapping.summary.txt file. This file contains the reference start and end coordinates. Repeating this mapping information in the per-read database or text output would bloat these files as this information is linked to the read as a whole as opposed to each individual call. Is this the information in which you are interested?

franrodalg commented 3 years ago

An individual read can only have a single call per modified base at any reference position.

I must admit I'm extremely shocked at the moment... (but I have double checked and there's no doubt that's the case). Is that a design choice or some requirement for it to actually work?

If multiple positions in the read map to the same position of the reference, how does Megalodon behave? Keeps only one of the alignments? If so, which one? Or does it collapse them somehow?

I suppose not many people deal with repetitive regions, but if this behaviour cannot be changed I don't think I can use Megalodon's output at all...

marcus1487 commented 3 years ago

This is a design choice. One point to clarify here is that Megalodon methylation calling starts from the assumption that the reference is correct. If the reference is not matched to the sample then the reference should be refined first (call and apply variants to reference, polish reference, etc). This is part of what gives Megalodon more power to call modified bases with higher accuracy. Taking the reference as correct means that Megalodon has no ambiguity and only has to distinguish the canonical form with the modified/methylated form.

The other apparent disconnect here is that Megalodon does not use the basecall mapping or the basecall-anchored modified base scores for the higher accuracy reference-anchored modified base calls. The basecalls are used to find the reference sequence and then determine a mapping between the raw signal and the reference bases. After this the basecalls are discarded (as they potentially contain errors and the reference should be correct). The final reference-anchored modified base calls are determined from the raw signal mapping and the reference sequence followed by custom algorithms applied to the intermediate basecaller output (this is the real power of megalodon for high accuracy methylation calling). Thus there is no need to make any decisions about multiple basecall positions mapping to the same reference base as the original basecalls have been ignored at the time the reference-anchored call is being made.

It seems that the two best solutions here might be:

  1. Polish the reference and then run Megalodon methylation calling on the polished reference with resolved repeats
  2. Use the basecall anchored modified base calls with custom scripts to determine how to handle multiple base calls mapping to the same reference position.
franrodalg commented 3 years ago

Thanks, Marcus.

Maybe I misunderstood entirely what Megalodon intends to do. Are the (modified) basecalls then not computed by Megalodon itself? Are they directly from Guppy?

I might need to provide a bit more context for our research in case you could help us, but it seems to me that the only solution at this point might be to parse the (modified) base calls manually.

Our research focuses on the ribosomal DNA (rDNA), which in mammals appears as tandem repeats across multiple chromosomes, totalling hundreds of copies, and with each copy (an rDNA "unit") spanning around 45kb. This makes it extremely challenging to deal with, since the units are not all entirely identical and the number of copies varies from individual to individual (and potentially from cell to cell, or at least tissue to tissue), leading to the rDNA not being included in genome assemblies. The only realistic way to work with it is to append a single consensus rDNA sequence to the whole genome assembly as an additional "chromosome".

For a paper we are currently wrapping up, we have Ultra-Long Nanopore reads from Mouse Embrionic Fibroblasts (MEFs), with several spanning multiple units. When aligning these, therefore, we have several regions within each read that align to the same rDNA consensus --- i.e., multiple positions in the reads will correspond to the same position in the reference. These multiple alignments potentially correspond to different genetic and epigenetic states, and our goal with Megalodon was trying to associate both --- i.e., identify whether particular rDNA haplotypes show distinct methylation levels/patterns. The per-read variant and modified base reports seemed perfect to address this target, but now it seems that won't be the case.

In a case like this, how does Megalodon proceed? I'm still not entirely sure how are the estimates obtained. Maybe a concrete example might help. Consider a read spanning three units. For simplicity, let's assume that the entire unit's methylation state is consistent (M for methylated, m for unmethylated), and that the haplotype (the shade of blue) is perfectly associated with the methylation state:

image

Assuming no quality differences between the basecalls/alignments, which methylation state would Megalodon assign to each reference CpG site across the entire read? M because it's the first one it appears? Or would it consider every occurrence of each position and attribute more probability to m because it's the most frequent one?

If Megalodon is not suitable for this situation, do you know if any other established tool could generate the per-read variant and modified base calls similar to the Megalodon ones but directly from the Guppy output?

marcus1487 commented 3 years ago

This concrete example definitely helps. With a reference including only a single copy of the rDNA consensus sequence Megalodon will only process one of the rDNA copies within this example read (the other two copies/mappings would be ignored completely). There will be supplementary mappings from minimap2 but Megalodon will ignore these mappings. Dealing with multi-mapping reads (both split read mappings as in this case and multiple reference hits) is a known issue, but requires some special care to make sure it is right and has thus not been added yet.

So in this case Megalodon will only process one of the subunits. This will be whichever mapping minimap supplies first (I'm not exactly sure how this is determined).

As noted previously, the basecalls and mappings are used only to link the raw signal and reference coordinates/sequence. So the limiting factor here is that there is no single contiguous mapping to the copies of rDNA as shown in your diagram.

As a workaround, could you add several concatenated units of the rDNA sequence to your reference instead of just the one unit (as shown in your diagram)? Then Megalodon will be able to map to these consecutive units and you can distinguish the parts of the read mapping to each individual rDNA unit. I agree that this sounds like an ideal application of the Megalodon outputs, so please do let me know if there are any issues with this potential solution as it would be great to see this work!

marcus1487 commented 3 years ago

To add a bit more background on the first part of your question (re: do the mod calls come from guppy or megalodon): the basecalling model specified defines the modified bases that can be detected. This modbase model produces two streams of data/probabilities (1: the canonical bases and 2: the modified state of the bases). When producing basecalls these streams of probabilities are decoded in order to determine the most likely sequence (the basecalls). Megalodon decodes these streams of probabilities in a slightly different way by proposing alternatives to the reference sequence (either variants or modified bases). To compute these scores a stretch of raw signal is queried for whether the reference or the proposed alternative (variant or modbase) is more likely. By fixing the rest of the reference (aside from the proposed alternative) in the region Megalodon improves the accuracy of the modified base scores as compared to the same calls without any knowledge of the reference sequence. There are more details about this algorithm, but this is the general principle. I hope this helps clear up how Megalodon functions a bit.

franrodalg commented 3 years ago

Thank you so much for your replies, Marcus. This is being really helpful.

Regarding the "workaround" --- As far as I know, with most aligners this would lead to the great majority of reads having ambiguous alignments and thus being "discarded", right? Unless a read spans at least the total number of conjoined units, it could be mapped to more than one location equally successfully. Singe the great majority of reads are shorter than a single unit, that would mean we would lose them:

image

Would this not be the case with minimap2/Megalodon? I might try and let you know anyway.

Regarding the "background" --- Sorry if your answer addresses this and I didn't quite get it, but my question was more focused on the --outputs mod_basecalls file. In that case, we would have methylation calls for every rDNA unit that appears in the read, regardless of whether they correspond to the first minimap2 alignment or one of the supplementaries, right? If so, how are the "additional" calls that Megalodon would ignore produced? Or are they actually ignored and the output file never includes modified bases for those extra units?

franrodalg commented 3 years ago

We might try creating multiple references with different number of units conjoined and run Megalodon once for each version. That seems pretty impractical, but with enough patience it might actually work!

marcus1487 commented 3 years ago

For the ambiguous alignments, one of these should be marked as the primary alignment. This alignment will be processed by Megalodon and thus you should have a valid mapping even for the single copy reads. I'm not sure if minimap2 will put all the alignments on the first copy or place them randomly, but I believe you should get valid results for shorter numbers of repeats than appear in the reference. I would suggest adding the largest number of copies to the reference that you hope to capture.

For the mod_basecalls question, this output will include all basecalls produced from the read. Nothing is thrown out from the read for this output. The calls that would be ignored in the per_read_mods output are ignored because the basecalls did not map to the reference and thus the signal corresponding to those basecalls was clipped. Thus that stretch of signal was never queried for methylation since it has no mapping to reference bases. The difference is whether the signal is queried without knowledge of the reference (standard basecall decoding) or if the signal is decoded with knowledge of the reference (thus requiring a mapping to the reference). So the ignored bits are not considered since the signal can't be anchored to any reference sequence.

I hope that might clear up the processing a bit. I understand it is a bit complicated (probably worth fleshing this out a bit more in the docs; this section in the docs is the current attempt).

franrodalg commented 3 years ago

It is indeed helping a lot!

I will try as you suggest and report the results. Do you think there would be any downside to including a substantially large number of units?

Edit: I just realised that, for this approach to fully work for variant calls as well, I would also need to artificially inflate the input VCF files to include as many copies as the reference, right?

marcus1487 commented 3 years ago

I don't think there would be an issue with adding a substantially large number of units, but a "real world" test would be best to determine this.

Yes the VCF file would have to be spoofed over all repeated units to extract the relevant information.

It seems that you are most interested in the per-read results, so hopefully the Megalodon output can port into your downstream analysis. Let me know if any essential information appears to be missing that you require.

marcus1487 commented 3 years ago

Has this workaround resolved your issue @franrodalg ?

franrodalg commented 3 years ago

Hi @marcus1487, thanks for asking.

This morning, I was about to send a detailed comment reporting what we've observed from the workaround. Long story short, it seems to work only partially. However, I now have some intuition on why that's the case, and I'm afraid there will be no way to sort that through a workaround.

In particular, the multi-unit alignments we got were much shorter than anticipated, and all completely unmethylated. Digging deeper, it seemed that these units all correspond to a particular haplotype which, from the single-unit alignments, seems to be the only one with methylation calls across the entire unit. It seems that, in the mouse at least, some rDNA units are substantially shorter than the canonical reference (~33kb compared to the conventional ~45kb). Since the multi-unit reference includes full-length units, reads that contain a short one see their alignment split, making it impossible to observe any multi-unit alignment that involves other haplotypes (and methylation states).

Since there is no way of including every possible combination of short and long units in a spoof reference, and no one knows how these are actually arranged (or even if there is any particular order), I can't see any way for us to exploit this workaround.

Any ideas welcome!

marcus1487 commented 3 years ago

This sounds like something that does require a fix to Megalodon. In particular the appropriate handling of split alignments should help this issue. I have been meaning to test this out (as noted in our previous discussions). I think this workaround should work once split read alignment is implemented. Note that in this fix I think I will disallow alignments which overlap on the reference or read. It sounds as though this should not inhibit this particular use case, but please let me know if this might cause any issues. I may make these disallowed alignments command line parameters to allow for user testing of the results.

franrodalg commented 3 years ago

Thanks so much, Marcus. I really appreciate it. Let me know if you need any help or test data. I hope it will not be very difficult to implement!

franrodalg commented 3 years ago

Hi Marcus! Have you had the chance to check how to address this? Would it be difficult?

marcus1487 commented 3 years ago

I have not had a look at this yet. The edit to allow multi-mappers is relatively trivial (handling the whole list instead of taking the first element here in the code). The larger issue is which extra alignments to exclude and retaining the read id information in a useful manner. I can look at add a first version as a testing branch soon.

franrodalg commented 3 years ago

Thanks, Marcus. Much appreciated

marcus1487 commented 3 years ago

Hi @franrodalg , I've just pushed a new branch sec_align which will allow supplementary/secondary alignments to be processed by megalodon. You can install this branch with the command pip install git+https://github.com/nanoporetech/megalodon.git@sec_align. With this installation you can add the --allow-supplementary-alignments flag to activate this new behavior. Hopefully I'll merge this into master for the next release (feedback would be helpful here ;).

I've had to make some decisions about where I've stored the information about the "mapping number" (enumeration of mappings for each read). The s/b/cram output files (mappings, mod_mappings and variant_mappings) will all contain the supplementary mappings. The mappings and mod_mappings outputs will have the supplementary mappings noted in the flag. The modified base calls will be stored in the database, but will not store the mapping number. I've enforced that reference sequence output and signal_mapping output will only include the primary mapping (these are the files used for model training so likely not applicable to most users).

Caveats! Note that this means that the same read could produce multiple modified base calls on the same reference position. Also the same position within a read could produce multiple modified base calls on different reference positions (leading to over-counting of methylation/modified bases). Same applies to variant calling. For this reason I will likely leave the processing of secondary alignments off by default when this is merged into master.

I hope this helps in your research and look forward to your impressions on your results!

P.S. I did run this on my standard old dataset of 20 reads I've tested with for years and found that one of them is likely a chimeric read, so looks as though this may indeed be useful!

franrodalg commented 3 years ago

Thanks so much, Marcus! I'll give it a try as soon as possible and let you know!

franrodalg commented 3 years ago

Hi Marcus,

Sorry for having delayed this so much. The holidays and other projects interfered quite a bit.

Unfortunately, I'm getting the following error:

 ******************** WARNING: Guppy and pyguppy minor versions do not match. This may lead to a failure. Please install matching Guppy and pyguppy versions. ********************
****************************************************************************************************
    ERROR: Error connecting to Guppy server. Undefined error: Undefined return code: result.bad_request
****************************************************************************************************

I'm not sure why this is appearing now. Am I supposed to install a particular version of ont-pyguppy-client-lib or update guppy to allow this version to work?

Cheers, Fran

Editing to add, this is my pip freeze output:

$ pip freeze
cycler==0.10.0
Cython==0.29.21
h5py==3.1.0
joblib==1.0.0
kiwisolver==1.3.1
mappy==2.17
matplotlib==3.3.3
megalodon @ git+https://github.com/nanoporetech/megalodon.git@ca2af7e6c983b010d8a01ef5ed2239424ca0edc2
numpy==1.19.5
ont-fast5-api==3.1.6
ont-pyguppy-client-lib==4.4.1
packaging==20.8
pandas==1.2.1
Pillow==8.1.0
progressbar33==2.4
pyparsing==2.4.7
pysam==0.16.0.1
python-dateutil==2.8.1
pytz==2020.5
scikit-learn==0.24.1
scipy==1.6.0
seaborn==0.11.1
six==1.15.0
sklearn==0.0
threadpoolctl==2.1.0
tqdm==4.56.0
marcus1487 commented 3 years ago

Yes the ont-pyguppy-client-lib version should match the guppy version. You can resolve this by installing the correct version of ont-pyguppy-client-lib via pip install ont-pyguppy-client-lib==4.2.2 (assuming that is your guppy version; see first couple lines of megalodon log.txt for Guppy and pyguppy versions). As noted in #84, I'm working on updating the error message to print this recommended fix.

Do let me know how you get on trying this feature. I'm thinking I'll include it in the next release, but always good to have more confidence it does what it's supposed to!

franrodalg commented 3 years ago

Thanks, Marcus. It's running now! I'll let you know asap

BTW, I didn't see the new option in the help. Will that be updated in the release?

marcus1487 commented 3 years ago

It is in the --help-long arguments (there's a lot in there). I could consider adding this to the main help. My main concern is that most users would not consider the caveats completely when setting this option so it is better buried in the full help to indicate that it is meant for advanced users only.

franrodalg commented 3 years ago

That actually makes a lot of sense!

marcus1487 commented 3 years ago

I think I may actually go a bit further and add a warning message any time this option is specified with text similar to the caveats listed above. At least this way users have been warned. Look forward to your results!

franrodalg commented 3 years ago

I'm getting the following error (partially anonymised):

[16:47:47] Processing reads
Full output or empty input queues indicate I/O bottleneck
3 most common unsuccessful processing stages:
    20.2% (    138 reads) : Pyguppy get completed reads invalid errorProcess ReadWorker005:. return_code: result.failed"
Traceback (most recent call last): region too short for variant calling.                                                
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/backends.py", line 819, in pyguppy_basecall
    read_sent = client.pass_read(pyguppy_read):13<03:35,  5.05reads/s, samples/s=3.74e+6]
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/pyguppy_client_lib/pyclient.py", line 251, in pass_read
    raise ConnectionError(calls           :   1%|          | 77/10000
ConnectionError: Not connected to the server. return_code: result.no_connection
output queue capacity per_read_mods       :   0%|          | 1/10000 
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/share/apps/centos7/python/gcc/4.8.5/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/share/apps/centos7/python/gcc/4.8.5/3.8.5/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/megalodon.py", line 315, in _process_reads_worker
    for bc_res in iter_bc_res():
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/megalodon.py", line 293, in iter_bc_res
    for bc_res in model_info.iter_basecalled_reads(
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/backends.py", line 1090, in iter_basecalled_reads
    for bc_res in self._run_pyguppy_backend(
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/backends.py", line 998, in _run_pyguppy_backend
    for called_read, sig_info, seq_summ_info in self.pyguppy_basecall(
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/megalodon/backends.py", line 829, in pyguppy_basecall
    client.connect()
  File "[ANON]/megalodon_sec_align/lib/python3.8/site-packages/pyguppy_client_lib/pyclient.py", line 167, in connect
    raise ConnectionError("Connection attempt timed out: {!r}".format(return_code))
ConnectionError: Connection attempt timed out: result.timed_out

but the job hasn't crashed. Not sure if it's progressing regardless, though.

marcus1487 commented 3 years ago

I've just experienced this issue myself internally. It seems to occur with long reads or an overloaded server. In any case I am currently reviewing the fix for this and will push to github shortly. Increasing the --guppy-timeout parameter may help in the meantime.

franrodalg commented 3 years ago

My current --guppy-timeout is 600.0. Do you think I should increase it further and restart? Or should I wait for the new version?

marcus1487 commented 3 years ago

That is quite high already. I should have the new code in shortly. I'm not sure that this fix will be the complete fix, that is coming in the longer term. You may have to lower the --reads-per-guppy-batch along with the --guppy-timeout parameter depending on your compute load (number of processes, system load etc).

marcus1487 commented 3 years ago

Fix pushed to master and merged into sec_align and pushed to github as well. Hopefully this works a bit better. Thanks for all the great feedback!

franrodalg commented 3 years ago

Thanks, Marcus. So shall I pip install --upgrade megalodon or do I need to rely on git cloning the repo?

marcus1487 commented 3 years ago

pip install git+https://github.com/nanoporetech/megalodon.git@sec_align is the best bet (may need to run pip uninstall megalodon first as the version hasn't been bumped yet).

franrodalg commented 3 years ago

Awesome! I will try first without changing timeout and reads per batch, and see if the new version alone solves it. Otherwise I might try and tweak the parameters and let you know which combination (hopefully) works!

franrodalg commented 3 years ago

Hi Marcus,

Unfortunately, I don't think it is working as expected. Megalodon has been running for 15 hours now, and no file other than log.txt has updated since three minutes after it started running. In case you are interested, I attach the log.txt file. Even though I haven't checked every line, I think most since the very start are failed attempts to connect to the Guppy server.

log.txt

Do you have any suggestion on how to proceed? Shall I decrease the --reads-per-guppy-batch and try again?

Cheers

franrodalg commented 3 years ago

It finally crashed:

ERROR: failed_reads Getter queue has unexpectedly died likely via a segfault error. Please log this issue.

franrodalg commented 3 years ago

By reducing --reads-per-guppy-batch to 10, it completed in less than half an hour, apparently without issue!

marcus1487 commented 3 years ago

Yes. These seems to be an issue with the latest version of the Guppy server getting overloaded and not accepting reads more often. I've tried to catch this in Megalodon, but turning down the --reads-per-guppy-batch is likely the best solution for now. A full fix will likely come with a later Guppy release and Megalodon minor version bump.

marcus1487 commented 3 years ago

Feature now merged into master and included in release 2.2.10.

franrodalg commented 3 years ago

Thanks, Marcus! I really appreciate your effort in making this happen, even if it's only useful for a very minor part of us :)

franrodalg commented 3 years ago

Sorry Marcus for resurfacing this after you've closed it, but I've been digging deeper and I'm not sure if I understand something.

As I told you before, our goal is to link methylation with haplotype in rDNA units. However, I see quite a bit of difference between the units that are represented in the methylation calls per read and the variant calls per read files:

image

image

As you can see, there seems to be a shift between them. Do you have any insights about what could be causing this? Are the supplementary alignments been used differently in these two estimates?

Cheers, Fran

franrodalg commented 3 years ago

I think I managed to find a way to address this, but I would appreciate if you could confirm this is expected. I have now kept a single entry in the variant calls per read data that shares read_id, snp position and ref and alt probabilities, since I assume it is extremely unlikely that two calls would actually have the exact same likelihoods.

image

Does that make sense?

marcus1487 commented 3 years ago

I was going to ask how the two graphs in the first plot were created. To me the "Number of rDNA units in read" would come from the mappings.summay.txt file. I'm not sure how these values would be extracted from the modified base and variant files. There are a number of reasons that a read might map over a region, but be unable to make a modified base or variant call. The main one is if a mod or variant occurs near a larger indel it can be impossible to find the path through the data and megalodon will not make a call. In any case I think counting the number of copies is more of a question for the mapping summary. I'd need a bit more information about how these plots were computed to help debug any issues.

franrodalg commented 3 years ago

Hi Marcus,

Sorry, I should have explained myself clearer. I am not trying to estimate how many units are actually in the reads from methylation and variant calls, just checking whether including supplementary alignments has actually made the analysis more exhaustive (i.e., we retain more CpGs and variant sites across the reads).

A simple example might help. Let's imagine that we are using a mock reference with 4 consecutive rDNA units, for which we have created the corresponding mock vcf file. (In reality we are using 20 units, but it's simpler to show here with just 4.) Also for simplicity, imagine that there is only one CpG (represented as a circle) and 2 variant sites (represented as triangles) in each unit:

image

Imagine that, for a read X, the modified bases per read file reports methylation for the CpG site at positions corresponding to units 1, 2, and 4 of the mock reference (this latter one probably from a separate alignment from the other two), and the variant calls per read file reports all SNPs for the same three units (represented with the shapes filled):

image

In this case, then, we would say that read X "has" 3 units according to both files.

It's unnecessary for all positions in the unit to appear in the files for us to say that the unit is included. For instance, in the following read Y:

image

we would say that there are 3 units included in both modified bases and variant calls per read, despite one variant position being missing in two of the units.

It's also unnecessary for both files to include the same number of units. For instance, for the following read Y:

image

the modified bases per read file includes only 2 units, but the variant calls per read file includes 4.

If only these three reads existed, then the figures I have included in previous posts would look as follows:

image

This is to say that it is not impossible that the two distributions don't look identical. In reality, however, with hundreds of potential CpGs and variant sites, it feels quite unlikely that a substantial number of reads would have units with variants covered but not CpGs, or vice versa.

Digging a bit deeper, I realised that I was (inadvertently) removing duplicates from the methylation data, and in reality both files show this strange peak at six exact duplications (i.e., equivalent positions reported as having the exact same probabilities):

image image

Just in case it is not clear, I've created this figures by counting the number of times each combination of read_id, strand, probabilities of the two alternatives and position (not according to the mock reference, but within the actual unit) appear in different "units" from the mock reference. I assume this should happen rarely by chance, so the fact that the most frequent number of duplications is 6 seems like a clear artefact of the alignment (i.e., the supplementary alignments).

Since I now see that both per read files lead to the same unlikely distribution, my question narrows down to whether keeping only one copy of all these exact duplicates might be the way to go, in your opinion.

Regarding the mappings.summay.txt file you mention, my output does not include anything like that. I guess that appears if one requests mappings as output, but I am currently requesting only the following: mod_basecalls, per_read_mods, mods, and per_read_variants variants. Should I be requesting mappings as well?

Thanks so much, Marcus.