Closed biorococo closed 9 months ago
I can't tell anything from the screenshots, what difference are you trying to highlight? If you feel there is a bug I will need test data.
So If you look at the modifications in the reverse strand in the 2.17 they are not showing off..
Attached a BAM for testing (region chr19:56,553,058-56,565,855)
Did you check your log file (folder
For example
WARNING [Jan 31,2024 12:37] [SAMAlignment] 3d2622cf-3c0f-4695-8c5d-735ba9999e1b MM base count validation failed: expected 8794'C's, actual count = 8456
You can turn validation off from the "BaseMods" tab of user preferences, scroll to the bottom.
I will double check the validation code but it is pretty simple and has been thoroughly reviewed. However it is curious that it appears that only minus strand reads of yours are failing.
See https://github.com/igvteam/igv/issues/1435 for more background on this.
Can you tell us what tool made the modified base calls, including version, and any tools that were run downstream of the base calls? I see some of your alignments are hard clipped. This could be one cause of the validation failure if they were hard clipped after the base calls were made. This is exactly the problem referenced in #1435 . The proposal for that problem was to disable base modifications altogether in the presence of hard clips, since we can no longer have confidence that the reported read sequence corresponds to the MM/ML tags.
BTW not all affected alignments are on the reverse strand, and also not all affected alignments are hard clipped.
Hi, thanks for looking into this, I actually missed the warning, I understand now. It is for sure useful to have this type of validation done.
After some debugging on my end I confirmed this issue is fixed in dorado 0.5.2 so it can be closed.
The bug can affect any modbase data that was generated on barcoded runs and that was demuxed with dorado from version 0.4.0 (either using "dorado demux" or "dorado basecaller --kit_name
Hard-clipping seems to be happening just in supplementary alignments from what I can see.
Thanks a lot for helping out with this.
The validation seems correct to me, but it fails often on test data I have (which is rather old). So I'm unsure myself if this should be on by default or not. In your case was this a real problem with the affected alignments? By that I mean is the base mod interpretation on these alignments correct when validation is off? Or put another way, is the validation flagging true problems?
And yes, hard clipping usually happens in supplementary alignments. For those cases the MM and ML tags are going to be incorrect unless they have been modified to account for the hard clipping. IGV has no way to know this, but the validation of # of required bases in the sequence should catch this problem as hard clipping without corrections to MM/ML is going to result in a lower base count than required. That was in fact the primary motivation for the validation, catch uncorrected hard-clipped alignments.
@biorococo Sorry to bother you but it would be very useful to me to know if the alignments that failed the validation were truly off. There is probably no way to know that but what is your sense, having investigated?
To continue, I'm inferring from your mention of adpater trimming and "demuxing" that indeed the MM tags on those alignments could be off due to removed sequence. is that correct? And how does this get fixed with dorado 0.5.2? Does it correct the MM tag if sequence is trimmed?
Hi! sorry for the delayed answer, I've been testing out some things on my end, as you indicated there were older versions of Dorado failing to pass the validation on some of the reads, before any trimming was implemented. I did confirm reads pass IGV validation when using Dorado >v0.5.0, so this is in fact not related to the specific issue solved in v0.5.2. In Dorado v0.5.0 was when MN tag was introduced. So it seems IGV fails validation when this tag is not present in the alignments. Do you think there can be an issue in how the validation is performed in these situations? Hopefully this shed some light on this issue, I will let you know if I find anything else that can be useful to fully understand this.
Also, to comment on the 'hard-clipping on supplementary' topic, looking at the warnings from IGV 2.17, for this specific subset, all reads failing the validation are primary alignments, and only one read have a supplementary alignment. so it does not seem related to any hard-clipping happening in the alignments.
If the MN tag is present its value is compared to the read sequence length. If its not present the validation is as follows, for each modification
(1) count the total number of modified and skipped bases from the MM tag. This is simply the sum of bases skipped + sum of modified bases. So MM:Z:C+m,1,3,0
implies at least 7 'C' bases in the read sequence, 3 modified and 4 skipped
(2) count the number of bases in the read sequence. In the example we would count the number of 'C' bases for a + strand alignment, or 'G' bases for a minus strand.
(3) compare, fails if the count of modified + skipped is > the count of corresponding bases in the read sequence.
@biorococo Do you have any ideas on how the validation above could fail, unless the read sequence was modified after the bases were called?
Hi, thanks a lot for the info, it has been useful.
One point to make is that for step (3) "modified + skipped" can be greater or equal to "the count of corresponding bases in the read sequence".
Taking a closer look at one read I think the IGV validation is not following the expected behaviour for step (2) either and it is actually not counting G's when reads are mapping in the reverse strand.
See warning for read 0e043392-d152-445d-8d76-9e158a4ba17b:
WARNING [Feb 02,2024 16:38] [SAMAlignment] 0e043392-d152-445d-8d76-9e158a4ba17b MM base count validation failed: expected 9074'C's, actual count = 8605
Evaluating MM tag and SEQ field I get that:
>>> print(sum([i + 1 for i in map(int, tag.split(";")[0].split(",")[1:])])) 9074 #minimum number of required Cs >>> seq_field.count("C") 8605 >>> seq_field.count("G") 9074
There are 9074 G's in the SEQ field which in this case is the same as the expected C's in the warning, so this read should pass the validation okey if number of G's are extracted as they should for a read mapping in the reverse strand.
The "actual count" reported in the warning matches the number of C's in the read, so it seems the wrong base is extracted when performing the evaluation.
Hopefully this can be easily confirmed on your side! Let me know if any other info is needed.
I think you are correct!
That's not the explanation for all failures though, consider read 51a5c942-e8c4-44dd-80be-035c01b7d631. The C+m modification requires at least 13876 "C"s. This is aligned to the reverse strand and there are only 7132 "G"s in the read sequence.
I don't understand this comment, how can modified + skipped be > the corresponding base count in a valid alignment?
One point to make is that for step (3) "modified + skipped" can be greater or equal to "the count of corresponding bases in the read sequence".
My comment on this issue and provided tiny_mod_demo_public.zip are irrelevant and broken. I somehow managed to confuse my alignment pipeline to produce invalid cram from valid, unaligned, bam. Nothing related to IGV. Sorry.
Hi!
@jrobinso regarding your last comment:
That's not the explanation for all failures though, consider read 51a5c942-e8c4-44dd-80be-035c01b7d631. The C+m modification requires at least 13876 "C"s. This is aligned to the reverse strand and there are only 7132 "G"s in the read sequence.
That read has a supplementary alignment and I would say the 7132 "G"s is for the supplementary seq field and not the primary?
Taking the primary I get:
>>> seq.count("C") 13877 >>> seq.count("G") 13863
while supplementary seq field is providing: >>> seq.count("C") 6728 >>> seq.count("G") 7134
Can you confirm that on your end?
Ignore the comment about step (3), sorry misunderstood one bit, my bad!
One point to make is that for step (3) "modified + skipped" can be greater or equal to "the count of corresponding bases in the read sequence".
Correct, so that read is correctly marked invalid.
This was in fact the motivation for the validation in the first place. Aligners hard-clip the sequence on supplementary alignments without correcting tags that depend on it, which is a pity but that's how it is. If you want to see base mods on supplementary alignments using a "soft clip" option, if its available, will fix this problem.
The corrected validation will be released soon, probably by the end of the week.
Makes sense, looking forward to try the new version.
Just a quick thing, when turning on and off the validation in preferences, the tracks won't be reloaded automatically! just in case that is not the expected behaviour.
Let me know if anything else is required from my side on this. Thanks!
I might remove that option to turn validation off, now that the bug is fixed I can't think of any reason you would want to turn it off. If the sequence doesn't match the tags you have no way to know visually, but the result is almost certainly wrong.
Fixed validation is released with 2.17.2
I'm still seeing this behavior in 2.17.2 for BAMs with more than one basemod, in this case, 5mC and 6mA. 2-color 5mC below, but same for 2-color 6mA.
https://s3-us-west-1.amazonaws.com/stergachis-manuscript-data/2023/Vollger_et_al_long-read_multi-ome/HG002_WGS.haplotagged.bam
2.16.2:
2.17.2:
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/12715663/ccs MM base count validation failed: expected 8697'T's, actual count = 8439
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/59839891/ccs MM base count validation failed: expected 6516'T's, actual count = 5704
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/36504343/ccs MM base count validation failed: expected 6091'T's, actual count = 5162
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/31790415/ccs MM base count validation failed: expected 5839'T's, actual count = 5075
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/68095404/ccs MM base count validation failed: expected 5435'T's, actual count = 4695
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/151456139/ccs MM base count validation failed: expected 4652'T's, actual count = 3557
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/131204332/ccs MM base count validation failed: expected 4370'T's, actual count = 3885
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/115148945/ccs MM base count validation failed: expected 2006'T's, actual count = 1827
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/198513968/ccs MM base count validation failed: expected 5152'T's, actual count = 4762
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/241899339/ccs MM base count validation failed: expected 4920'T's, actual count = 4500
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/221381867/ccs MM base count validation failed: expected 1849'T's, actual count = 1637
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/120526379/ccs MM base count validation failed: expected 4943'T's, actual count = 4644
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/17303884/ccs MM base count validation failed: expected 7115'T's, actual count = 6807
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/80744695/ccs MM base count validation failed: expected 2173'T's, actual count = 1856
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/220069920/ccs MM base count validation failed: expected 4951'T's, actual count = 4375
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/70784356/ccs MM base count validation failed: expected 5901'T's, actual count = 5380
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/58003053/ccs MM base count validation failed: expected 5047'T's, actual count = 4512
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/92537065/ccs MM base count validation failed: expected 4327'T's, actual count = 3833
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/261883019/ccs MM base count validation failed: expected 3569'T's, actual count = 3433
WARNING [Feb 29,2024 16:28] [SAMAlignment] m84039_230418_220405_s4/30152607/ccs MM base count validation failed: expected 3115'T's, actual count = 3075
WARNING [Feb 29,2024 16:28] [SAMAlignment] MM validation warning count exceeded. Further failures will not be logged.
Disabling the validation works in 2.17.2.
This is not necessarily a bug, the validation is there for a reason. If you feel the validation is incorrect for your BAM please zip and post some sample data and I will look into it. There is really no way to catch invalid tags visually.
This is the slice for the image above. The validation errors are coming from the 6mA calls on the -
strand. Thanks for taking a look. This is my first time working with data with more than one basemod type.
Thanks for the data, I'll have a look.
OK I think I see the problem. This dataset has both A+a and T-a calls, which is something I haven't seen before in any test data. I think the validation is failing for the T- calls. Apologies, this will be fixed soon for 2.17.3, in the meantime keep validation off. Thanks for the report. We'll get this right eventually.
Thanks, Jim and Billy, for checking this. I was worried for a second that I had messed up all the reverse strand calling with fibertools.
@mrvollger No its definitely an IGV problem. Thanks for the report. I'm hampered here with almost no data to test with other than what comes in with git issues, but I do think this will be the last issue with validation.
Thanks for looking into this!
Yep, totally my fault. I've pushed the fix and the example validates completely. My mistake was complementing bases when calls were on the (-) strand, eg. T-a => look for A's. This is wrong, we are still looking for "T" in the "top strand" of the sequence the instrument reports. The "T" becomes "A" in the BAM record if reverse strand flag is set. In any event it looks good now. We'll be releasing 2.17.3 soon.
Your validation efforts are very appreciated. It's hard to encode these and even harder to check if they are right so I appreciate that I can use IGV as a test :)
It is a confusing spec for sure. I think the validation is finally correct in 2.17.3 but do let us know if you encounter other problems.
Same alignment is showing a different modifications
The reads are all primary alignment, so it feels there is bug when interpreting mod tags.
Let me know if you need any other detail for this.