trahay / Hierarchical-Trace-Format

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Memory errors with the AMG Benchmark #1

Closed khatharsis42 closed 1 year ago

khatharsis42 commented 1 year ago

Steps to reproduce

Error is present as of 786eb408558741abca3ebaedf9e949bb88d06213, and quickfix was implemented in 262c6fec765b7e03efff7f8f90cd367dca3dc169

Launch the AMG Benchmark w/ at least 4 MPI Processes, and OMP_NUM_THREADS=1: OMP_NUM_THREADS=1 mpirun -np 4 eztrace -t mpi ./test/amg -n 100 100 100

Expected result

On 786eb408558741abca3ebaedf9e949bb88d06213, you will get a failed realloc because of an invalid pointer.

After 262c6fec765b7e03efff7f8f90cd367dca3dc169, you will get a warning that an event wasn't correctly initiated. Sometimes, the error corrupted double-linked list will appear.

Cause of the issue

It appears some of the sequence structs are either not correctly initialized, or a heap overflow causes garbage to we written over them. It appears it is always the first sequence of a thread writer that is corrupted that way (thread_writer->og_seq[0]).

It is easily detectable since sequence.allocated is always a power of 2, which is how the quickfix in 262c6fec765b7e03efff7f8f90cd367dca3dc169 was implemented (see FIXME in _htf_get_cur_sequence in htf_write.c)

A better fix was implemented in 0db248d6bde8bc61fe9bab4ad0f54ecb3f7bb4c2, where the event struct was given a uint32 initialized value, which is set to 1 in the _event_init macro.

This has only been spotted with the AMG benchmark so far.

trahay commented 1 year ago

I think I found the cause of the problem: struct htf_thread contains a struct htf_sequence *sequences; array that may be realloced (see the call to DOUBLE_MEMORY_SPACE in _htf_get_sequence_id_from_array). It means that the address of a sequence may change.

The problem is that struct htf_thread_writer refers to the ongoing sequences with struct htf_sequence **og_seq; -> when the sequences array is realloced, some og_seq pointers become invalid.

A solution could be to:

khatharsis42 commented 1 year ago

Fixed in commit 813900f2c5ad77f83164eb17d9919512db53449e :ok_hand: