Closed khatharsis42 closed 1 year ago
I think I found the cause of the problem:
struct htf_thread
contains a struct htf_sequence *sequences;
array that may be realloc
ed (see the call to DOUBLE_MEMORY_SPACE
in _htf_get_sequence_id_from_array
). It means that the address of a sequence may change.
The problem is that struct htf_thread_writer
refers to the ongoing sequences with struct htf_sequence **og_seq;
-> when the sequences array is realloced, some og_seq
pointers become invalid.
A solution could be to:
struct htf_sequence **sequences;
in the struct htf_thread
Fixed in commit 813900f2c5ad77f83164eb17d9919512db53449e :ok_hand:
Steps to reproduce
Error is present as of 786eb408558741abca3ebaedf9e949bb88d06213, and quickfix was implemented in 262c6fec765b7e03efff7f8f90cd367dca3dc169
Launch the AMG Benchmark w/ at least 4 MPI Processes, and OMP_NUM_THREADS=1:
OMP_NUM_THREADS=1 mpirun -np 4 eztrace -t mpi ./test/amg -n 100 100 100
Expected result
On 786eb408558741abca3ebaedf9e949bb88d06213, you will get a failed
realloc
because of an invalid pointer.After 262c6fec765b7e03efff7f8f90cd367dca3dc169, you will get a warning that an event wasn't correctly initiated. Sometimes, the error
corrupted double-linked list
will appear.Cause of the issue
It appears some of the
sequence
structs are either not correctly initialized, or a heap overflow causes garbage to we written over them. It appears it is always the first sequence of a thread writer that is corrupted that way (thread_writer->og_seq[0]
).It is easily detectable since
sequence.allocated
is always a power of 2, which is how the quickfix in 262c6fec765b7e03efff7f8f90cd367dca3dc169 was implemented (see FIXME in_htf_get_cur_sequence
inhtf_write.c
)A better fix was implemented in 0db248d6bde8bc61fe9bab4ad0f54ecb3f7bb4c2, where the
event
struct was given auint32 initialized
value, which is set to 1 in the_event_init
macro.This has only been spotted with the AMG benchmark so far.