Closed erin2722 closed 3 months ago
Hi @erin2722,
Is it possible that some other sampled memory is ending up in the address space of the GuardedPageAllocator, and is therefore being validated upon deallocation when it is not intended to?
If that would be possible, that would be a serious bug that can lead to arbitrary memory corruptions. I don't immediately see how this is possible. GuardedPageAllocator allocated that memory with mmap in Init method.
GuardedPageAllocator circumvents system-alloc's spinlock, which may be unintentional, but the mmap in system-alloc does not use MAP_FIXED, only MAP_FIXED_NOREPLACE (if available). Without MAP_FIXED overlap in hints must not lead to overlapping ranges being allocated.
If you can reproduce this at least semi-reliably, I would suggest to trace mmap's with strace of printf's to confirm/disprove possible overlapping.
Hi @dvyukov ,
Thank you so much for the quick response! I will see what I can do in terms of reproing this, and let you know what I find.
Since you have a core dump, I was curious if anything obvious stood out about the contents of guarded_page_allocator_
and the faulting address.
Without the call to ActivateGuardedSampling
, I'd expect the begin/end address ranges of guarded_page_allocator_
to be 0
and PointerIsMine
to always fail, but maybe something unusual is happening.
When I last looked at the code I read it as: GuardedPageAllocator::Init mmaps memory and initializes begin/end, and then ActivateGuardedSampling sets the flag to start allocating guarded allocations, and they are separate. If that's the case, we can have begin/end non-0, but no allocations, and PointerIsMine can still return true (due to some corruption presumably).
Yup, that is also my interpretation of the code @dvyukov , which is validated by the state of the GuardedPageAllocator when the crash happens:
(gdb) f 0
#0 tcmalloc::tcmalloc_internal::GuardedPageAllocator::Deallocate (this=0x7fc1873258e0 <tcmalloc::tcmalloc_internal::Static::guardedpage_allocator_>, ptr=ptr@entry=0x438f3fe00000) at src/third_party/tcmalloc/dist/tcmalloc/guarded_page_allocator.cc:223
223 *reinterpret_cast<char*>(ptr) = 'X'; // Trigger SEGV handler.
(gdb) p *this
$1 = {
stacktrace_filter_ = {
stack_hashes_with_count_ = {{
<std::__atomic_base<unsigned long>> = {
_M_i = 0
},
} <repeats 256 times>},
max_slots_used_ = {
<std::__atomic_base<unsigned long>> = {
_M_i = 0
},
},
replacement_inserts_ = {
<std::__atomic_base<unsigned long>> = {
_M_i = 0
},
}
},
guarded_page_lock_ = {
lockword_ = {
<std::__atomic_base<unsigned int>> = {
_M_i = 0
},
}
},
free_pages_ = {true <repeats 128 times>, false <repeats 384 times>},
num_alloced_pages_ = 0,
num_alloced_pages_max_ = 0,
num_successful_allocations_ = {
value_ = {
<std::__atomic_base<long>> = {
_M_i = 0
},
}
},
num_failed_allocations_ = {
value_ = {
<std::__atomic_base<long>> = {
_M_i = 0
},
}
},
data_ = 0x2d563ff86120,
pages_base_addr_ = 0x438f3fc00000,
pages_end_addr_ = 0x438f3fe02000,
first_page_addr_ = 0x438f3fc02000,
max_alloced_pages_ = 64,
total_pages_ = 128,
total_pages_used_ = 0,
alloced_page_count_when_all_used_once_ = 0,
page_size_ = 8192,
rand_ = {
<std::__atomic_base<unsigned long>> = {
_M_i = 140469173639392
},
},
initialized_ = true,
allow_allocations_ = false,
double_free_detected_ = true,
write_overflow_detected_ = false
}
We can see here that although allow_allocations_
is false, and num_successful_allocations_
is 0, the ptr
argument is 0x438f3fe00000
, which falls within the range of pages_base_addr_
to pages_end_addr_
, causing PointerIsMine
to succeed and the deallocation to go through validation.
It then detects a double free even though one is not present, because free_pages_
has been filled with true
during initialization, and ReserveFreeSlot
, which is what updates the free_pages_
to have false values for specific slots, will return early because allow_allocations_
is false, and so IsFreed
will always return true
, causing a false double-free detection.
Just for extra info, I am using the tcmalloc version as of this commit https://github.com/google/tcmalloc/commit/18777b14757feee05771bd299039fa4938259b8f, and the issue started appearing after we upgraded from https://github.com/google/tcmalloc/commit/093ba93c1bd6dca03b0a8334f06d01b019244291.
Hi all! After investigating, I believe that this was an issue with the porting of the tcmalloc build from bazel into our native build system-- we dropped the linkstatic=1
flags, and I think improper symbol resolution on dynamic builds was leading to this issue. Closing this issue, and thanks for the help!
We may need to reopen this issue -- we ended up tracking down what's going on, and it's not related to linking.
TCMalloc introduced MAP_FIXED_NOREPLACE
with this commit, which is broken on Linux kernel versions 4.17 and 4.18, fixed in 4.19.
This is what causes the issue seen in the beginning of the issue. In our testing on a machine with kernel version 4.18, this sequence of events can happen:
GuardedPageAllocator
maps pages allocating roughly 2MB.SampleifyAllocation
, we end up creating a sampled page, which flows through to creating the first mmap region for sampled allocations and allocates 1GB for that region. GuardedPageAllocator
, clobbering the GuardedPageAllocator
's pages.GuardedPageAllocator
believes it owns can be deallocated, tripping the check that the allocation is guarded, and ultimately causing a segfault because the GuardedPageAllocator
believes it is seeing a double free.There are a couple things we could do here, but I think the least invasive change would be to add another check into MapFixedNoReplaceFlagAvailable()
to check if the currently running kernel version is susceptible to the MAP_FIXED_NOREPLACE
bug.
Quick turnaround! Before I could even put up a PR myself. Cheers!
On Tue, May 21, 2024 at 5:07 PM copybara-service[bot] < @.***> wrote:
Closed #229 https://github.com/google/tcmalloc/issues/229 as completed via 4674cfc https://github.com/google/tcmalloc/commit/4674cfcd0860026db7daa7dfb5fd3ee842c845cc .
— Reply to this email directly, view it on GitHub https://github.com/google/tcmalloc/issues/229#event-12885497248, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6GSBJ3YH2TIKRCLBSFPTZDOZRZAVCNFSM6AAAAABF5OK66OVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSHA4DKNBZG4ZDIOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
When running tests with tcmalloc, I have occasionally seen the program crash with the following lines appearing in the backtrace:
When examining the program with gdb, I can see that the program is crashing because it detected a memory error in the
GuardedPageAllocator
-- however, the application has not calledActivateGuardedSampling
, and so this is unexpected behavior.From examining the tcmalloc code, I see that
ActivateGuardedSampling
flips a setting that allows theGuardedPageAllocator
to allocate bytes within it's defined address space. However, on deallocation, that setting is not checked, and tcmalloc simply checks whether the deallocated pointer is within its address space, and then goes on with the memory checks (and possible crashes) if that is true. Is it possible that some other sampled memory is ending up in the address space of theGuardedPageAllocator
, and is therefore being validated upon deallocation when it is not intended to?Is it a bug on tcmalloc's end that it is crashing on deallocations like this? Or is there anything else that can explain this behavior?