samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

Base mod tweaks to improve validity checking. #1749

Closed jkbonfield closed 7 months ago

jkbonfield commented 7 months ago

We may perhaps want to have a control over whether we hard error or not? If so maybe add some mitigation (nullifying) instead?

daviesrob commented 7 months ago

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Also, while playing with this, I noticed that both pileup_mod and test_mod leak memory on failure. In the latter case it's due to a missing sam_close() in the failure clean-up path. pileup_mod is used in some tests with a non-zero return. Unfortunately it's not possible to see the leak for them in the CI tests because stderr gets redirected by the test harness, and we don't distinguish the leak-sanitizer non-zero return from the one we expect. It is possible to see when running by hand though:

$ ./test/pileup_mod test/base_mods/MM-MNf1.sam 
[E::bam_parse_basemod2] r1: MM/MN data length is incompatible with SEQ length

=================================================================
==20713==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 11536 byte(s) in 1 object(s) allocated from:
    #0 0x7f65233df7f7 in __interceptor_calloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x405bf4 in pileup_cd_create test/pileup_mod.c:75

SUMMARY: AddressSanitizer: 11536 byte(s) leaked in 1 allocation(s).

Also, while fiddling with this I accidentally ran the wrong program (test_mod) on the test/base_mods/MM-pileup.sam file, and got this:

==19796==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c00000003f at pc 0x0000004902b8 bp 0x7ffc2c9086f0 sp 0x7ffc2c9086e8
READ of size 1 at 0x60c00000003f thread T0
    #0 0x4902b7 in bam_mods_at_next_pos sam_mods.c:504
    #1 0x492f39 in bam_next_basemod sam_mods.c:611
    #2 0x406488 in main ./test/test_mod.c:187
    #3 0x7f1338598c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #4 0x407079 in _start (./test/test_mod+0x407079)

0x60c00000003f is located 1 bytes to the left of 128-byte region [0x60c000000040,0x60c0000000c0)
allocated by thread T0 here:
    #0 0x7f133a609d15 in __interceptor_realloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:85
    #1 0x45896c in sam_realloc_bam_data sam.c:446

which suggests a more worrying problem somewhere...

jkbonfield commented 7 months ago

I'll look at the other programs once I've finished my current PR, but in answer to the first question:

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Yes. It's a best efforts thing basically. We can detect errors on the reverse strand because it means parsing all the way through the base modification string and reading it backwards. Hence we detect an overflow case right at the start. We could obviously do that full parse everywhere, but that would slow things down. The strategy I took was report errors as we find them rather than a full validation pass every time.

If you call a bam_next_basemod iterator which is what my new samtools fixmate -M code is doing then you'll trigger the error at the other end too. It's possible the API being used by the test harness is taking a different route somewhere.

jkbonfield commented 7 months ago

I think what I have should now be passing the memory checkers. I still need to think of more examples, but there are 2 dedicated test files already whose purposes was specifically to test MM strings referring to bases off the ends of sequences. Both test_mod and pileup_mod report errors and return with non-zero exit status, confirming the tests work.

What extra tests were you thinking of?

daviesrob commented 7 months ago

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

jkbonfield commented 7 months ago

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

MM being incompatible with SEQ length does mean MM refers to data beyond the end of the sequence. So it's the same thing.

jkbonfield commented 7 months ago

I added MM overflow errors too in the tests. I also moved the code performing this test, as as reported it didn't trigger on all cases. I think when I wrote it I was maybe testing a base modification right at the end of the sequence and also one beyond the end, but I can't be sure now. This does appear to work though and it still fixes things in mpileup -M.