ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Posix write of uninitialised bytes #184

Open ax3l opened 6 years ago

ax3l commented 6 years ago

ADIOS 1.13.1 is adding noise to valgrind debugging with uninitialised byte(s) in the syscall to write(buf) in adios_posix_write_pg1 (adios_posix.c:733).

The issue is triggered when previously a adios_databuffer_resize (buffer.c:68) is called - I guess a few unimportant but still uninitialized bytes slip into the final write call from it.

Detailed valgrind warning for ADIOS 1.13.1 with _nompi library write API:

==22326== Syscall param write(buf) points to uninitialised byte(s)
==22326==    at 0x5B5C190: __write_nocancel (syscall-template.S:84)
==22326==    by 0x2F0F6A: adios_posix_write_pg (adios_posix.c:733)
==22326==    by 0x2F166B: adios_posix_close (adios_posix.c:945)
==22326==    by 0x2993F6: common_adios_close (common_adios.c:1264)
==22326==    by 0x2951BA: adios_close (adios.c:210)

==22326==  Address 0x72323be is 894 bytes inside a block of size 16,777,223 alloc'd
==22326==    at 0x4C2BADF: malloc (vg_replace_malloc.c:298)
==22326==    by 0x4C2DE5F: realloc (vg_replace_malloc.c:785)
==22326==    by 0x2FC133: adios_databuffer_resize (buffer.c:68)
==22326==    by 0x297060: common_adios_open (common_adios.c:417)
==22326==    by 0x294E94: adios_open (adios.c:90)

@pnorbert do you have an idea what's uninitialised here? Maybe you can limit the write() to the initialized bytes or simply zero-init the reallocated buffer's hanging bytes? Even if it might not be a bug, it makes downstream analysis of memory leaks very noisy as soon as we use ADIOS.

ax3l commented 5 years ago

ping @pnorbert do you have an idea what's uninitialised here?

pnorbert commented 5 years ago

Sorry I dropped the ball on this. Can you give me an example code? I can't recreate this valgrind error with a simple C example writing variables and attributes.

On Wed, Aug 8, 2018 at 5:01 AM, Axel Huebl notifications@github.com wrote:

ping @pnorbert https://github.com/pnorbert do you have an idea what's uninitialised here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/184#issuecomment-411337711, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLTxLibRMf17dKioUn08KwmMqrqukks5uOqjvgaJpZM4UqxyQ .

ax3l commented 5 years ago

If you have CMake 3.10.0+ and ADIOS 1.13.1 in your CMAKE_PREFIX_PATH and adios_config in your PATH, you can copy paste these lines:

# sources
git clone https://github.com/openPMD/openPMD-api.git

mkdir -p openPMD-api-build
cd openPMD-api-build

# configure & build
cmake ../openPMD-api -DopenPMD_USE_HDF5=OFF -DopenPMD_USE_PYTHON=OFF
cmake --build .

# find uninitialised bytes
valgrind bin/SerialIOTests

Corresponding unit test file: https://github.com/openPMD/openPMD-api/blob/c0ccefb25503b49a10938e2096098c6e3ea5c13b/test/SerialIOTest.cpp

Full output:

$ valgrind bin/SerialIOTests                                                                                                          
==30204== Memcheck, a memory error detector
==30204== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==30204== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==30204== Command: bin/SerialIOTests
==30204== 
==30204== Syscall param write(buf) points to uninitialised byte(s)
==30204==    at 0x61DD190: __write_nocancel (syscall-template.S:84)
==30204==    by 0x4EB602F: adios_posix_write_pg (adios_posix.c:733)
==30204==    by 0x4EB6730: adios_posix_close (adios_posix.c:945)
==30204==    by 0x4E5E337: common_adios_close (common_adios.c:1264)
==30204==    by 0x4E5A0FB: adios_close (adios.c:210)
==30204==    by 0x4E4BC53: openPMD::ADIOS1IOHandlerImpl::~ADIOS1IOHandlerImpl() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x4E4C29E: openPMD::ADIOS1IOHandler::~ADIOS1IOHandler() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x1DEB58: openPMD::Writable::~Writable() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x176F05: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x142F21: ____C_A_T_C_H____T_E_S_T____2() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x130DF5: Catch::RunContext::invokeActiveTestCase() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x160867: Catch::(anonymous namespace)::runTests(std::shared_ptr<Catch::Config> const&) (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==  Address 0x83c13be is 894 bytes inside a block of size 16,777,223 alloc'd
==30204==    at 0x4C2BADF: malloc (vg_replace_malloc.c:298)
==30204==    by 0x4C2DE5F: realloc (vg_replace_malloc.c:785)
==30204==    by 0x4EC11F8: adios_databuffer_resize (buffer.c:68)
==30204==    by 0x4E5BFA1: common_adios_open (common_adios.c:417)
==30204==    by 0x4E59DD5: adios_open (adios.c:90)
==30204==    by 0x4E4BBE7: openPMD::ADIOS1IOHandlerImpl::~ADIOS1IOHandlerImpl() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x4E4C29E: openPMD::ADIOS1IOHandler::~ADIOS1IOHandler() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x1DEB58: openPMD::Writable::~Writable() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x176F05: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x142F21: ____C_A_T_C_H____T_E_S_T____2() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x130DF5: Catch::RunContext::invokeActiveTestCase() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x160867: Catch::(anonymous namespace)::runTests(std::shared_ptr<Catch::Config> const&) (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204== 
==30204== Syscall param write(buf) points to uninitialised byte(s)
==30204==    at 0x61DD190: __write_nocancel (syscall-template.S:84)
==30204==    by 0x4EB6177: adios_posix_write_index (adios_posix.c:776)
==30204==    by 0x4EB6EDB: adios_posix_close (adios_posix.c:1087)
==30204==    by 0x4E5E337: common_adios_close (common_adios.c:1264)
==30204==    by 0x4E5A0FB: adios_close (adios.c:210)
==30204==    by 0x4E4BC53: openPMD::ADIOS1IOHandlerImpl::~ADIOS1IOHandlerImpl() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x4E4C29E: openPMD::ADIOS1IOHandler::~ADIOS1IOHandler() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x1DEB58: openPMD::Writable::~Writable() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x176F05: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x142F21: ____C_A_T_C_H____T_E_S_T____2() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x130DF5: Catch::RunContext::invokeActiveTestCase() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x160867: Catch::(anonymous namespace)::runTests(std::shared_ptr<Catch::Config> const&) (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==  Address 0x807e4f1 is 1,185 bytes inside a block of size 1,000,020 alloc'd
==30204==    at 0x4C2BADF: malloc (vg_replace_malloc.c:298)
==30204==    by 0x4C2DE5F: realloc (vg_replace_malloc.c:785)
==30204==    by 0x4E62A75: buffer_write (adios_internals.c:2124)
==30204==    by 0x4E671B3: adios_write_index_v1 (adios_internals.c:4086)
==30204==    by 0x4EB67D7: adios_posix_close (adios_posix.c:960)
==30204==    by 0x4E5E337: common_adios_close (common_adios.c:1264)
==30204==    by 0x4E5A0FB: adios_close (adios.c:210)
==30204==    by 0x4E4BC53: openPMD::ADIOS1IOHandlerImpl::~ADIOS1IOHandlerImpl() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x4E4C29E: openPMD::ADIOS1IOHandler::~ADIOS1IOHandler() (in /home/axel/build/openPMD-api-build/lib/libopenPMD.ADIOS1.Serial.so)
==30204==    by 0x1DEB58: openPMD::Writable::~Writable() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x176F05: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204==    by 0x142F21: ____C_A_T_C_H____T_E_S_T____2() (in /home/axel/build/openPMD-api-build/bin/SerialIOTests)
==30204== 
HZDR sample not accessible. (Supplied directory is not valid: ../samples/hzdr-sample/bp/)
===============================================================================
All tests passed (26 assertions in 5 test cases)

==30204== 
==30204== HEAP SUMMARY:
==30204==     in use at exit: 2,641,333 bytes in 26,714 blocks
==30204==   total heap usage: 77,532 allocs, 50,818 frees, 93,048,116 bytes allocated
==30204== 
==30204== LEAK SUMMARY:
==30204==    definitely lost: 3,480 bytes in 29 blocks
==30204==    indirectly lost: 2,636,176 bytes in 26,682 blocks
==30204==      possibly lost: 0 bytes in 0 blocks
==30204==    still reachable: 1,677 bytes in 3 blocks
==30204==         suppressed: 0 bytes in 0 blocks
==30204== Rerun with --leak-check=full to see details of leaked memory
==30204== 
==30204== For counts of detected and suppressed errors, rerun with: -v
==30204== Use --track-origins=yes to see where uninitialised values come from
==30204== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
ax3l commented 5 years ago

@pnorbert I guess the variables for your test need to be large enough to trigger the ADIOS realloc. So > 16MB if I remember the defaults right ;-)

pnorbert commented 5 years ago

That's not it. My example works fine with 80MB variable. I can see the valgrind error in the SerialIOTests. How can I figure which actual test is causing this error? Where are the output files from this test? I need to see what is so special about your test case.

On Wed, Aug 8, 2018 at 11:04 AM, Axel Huebl notifications@github.com wrote:

@pnorbert https://github.com/pnorbert I guess the variables for your test need to be large enough, to trigger the ADIOS realloc. So > 16MB if I remember the defaults right ;-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/184#issuecomment-411438739, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLUoIvRmpn2BWi3y-apKC4dyEL0d7ks5uOv4BgaJpZM4UqxyQ .

pnorbert commented 5 years ago

Never mind, I found them. The error is related to long double attributes, not to buffer size.

On Wed, Aug 8, 2018 at 5:41 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

That's not it. My example works fine with 80MB variable. I can see the valgrind error in the SerialIOTests. How can I figure which actual test is causing this error? Where are the output files from this test? I need to see what is so special about your test case.

On Wed, Aug 8, 2018 at 11:04 AM, Axel Huebl notifications@github.com wrote:

@pnorbert https://github.com/pnorbert I guess the variables for your test need to be large enough, to trigger the ADIOS realloc. So > 16MB if I remember the defaults right ;-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/184#issuecomment-411438739, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLUoIvRmpn2BWi3y-apKC4dyEL0d7ks5uOv4BgaJpZM4UqxyQ .

ax3l commented 5 years ago

Thanks, great news! Just ping me if I shall check again with updates in master.

pnorbert commented 5 years ago

This simple example exhibits the valgrind problem with long doubles.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char ** argv) 
{
    char        filename[] = "test_longdouble_valgrind.data";
    long double ld1 = 1.2345e+80;

    long double *bufm = (long double *) malloc (sizeof(long double));
    long double *bufc = (long double *) calloc (1, sizeof(long double));

    memcpy (bufm, &ld1, sizeof(long double));
    memcpy (bufc, &ld1, sizeof(long double));

    int fd = creat(filename, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
    write(fd, bufm, sizeof(long double));
    write(fd, bufc, sizeof(long double));
    close(fd);
    free(bufm);
    free(bufc);
    return 0;
}

Valgrind complains about both the malloc'd and calloc'd buffers.

$ valgrind ./test_longdouble_valgrind 
==15574== Memcheck, a memory error detector
==15574== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==15574== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==15574== Command: ./test_longdouble_valgrind
==15574== 
==15574== Syscall param write(buf) points to uninitialised byte(s)
==15574==    at 0x4F312C0: __write_nocancel (syscall-template.S:84)
==15574==    by 0x40083F: main (in /home/adios/work/test/other_tests/test_longdouble_valgrind)
==15574==  Address 0x520404a is 10 bytes inside a block of size 16 alloc'd
==15574==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15574==    by 0x4007D8: main (in /home/adios/work/test/other_tests/test_longdouble_valgrind)
==15574== 
==15574== Syscall param write(buf) points to uninitialised byte(s)
==15574==    at 0x4F312C0: __write_nocancel (syscall-template.S:84)
==15574==    by 0x400855: main (in /home/adios/work/test/other_tests/test_longdouble_valgrind)
==15574==  Address 0x520409a is 10 bytes inside a block of size 16 alloc'd
==15574==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15574==    by 0x4007EB: main (in /home/adios/work/test/other_tests/test_longdouble_valgrind)
ax3l commented 5 years ago

Do you know how to fix it?

ax3l commented 5 years ago

Oh dang, it's a stdlib or valgrind issue!

ax3l commented 5 years ago

Hm, maybe that's relevant: http://valgrind.org/docs/manual/manual-core.html

Precision: There is no support for 80 bit arithmetic. Internally, Valgrind represents all such "long double" numbers in 64 bits, and so there may be some differences in results. Whether or not this is critical remains to be seen. Note, the x86/amd64 fldt/fstpt instructions (read/write 80-bit numbers) are correctly simulated, using conversions to/from 64 bits, so that in-memory images of 80-bit numbers look correct if anyone wants to see.

10 bytes inside a block of size 16 alloc'd

pnorbert commented 5 years ago

This is a valgrind issue. Filling the buffers with some value (255) before the memcpy does not help.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char ** argv)
{
    char        filename[] = "test_longdouble_valgrind.data";
    long double ld1 = 1.2345e+80;

    long double *bufm = (long double *) malloc (sizeof(long double));
    long double *bufc = (long double *) calloc (1, sizeof(long double));

    memset (bufm, 255, sizeof(long double));
    memset (bufc, 255, sizeof(long double));

    printf("bufm[15]=%3.3hhu\n", ((char*)bufm)[15]);

    memcpy (bufm, &ld1, sizeof(long double));
    memcpy (bufc, &ld1, sizeof(long double));

    printf("bufm[15]=%3.3hhu\n", ((char*)bufm)[15]);

    int fd = creat(filename, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
    write(fd, bufm, sizeof(long double));
    write(fd, bufc, sizeof(long double));
    close(fd);

    free(bufm);
    free(bufc);

    return 0;
}

The file content shows that all 16 bytes of the buffer has been filled. The last 6 bytes are zeros.

$ ./test_longdouble_valgrind 
bufm[15]=255
bufm[15]=000
$ od -bv test_longdouble_valgrind.data 
0000000 000 350 363 217 374 121 104 205 011 101 100 000 000 000 000 000
0000020 000 350 363 217 374 121 104 205 011 101 100 000 000 000 000 000
ax3l commented 5 years ago

Ok, I'll report that upstream with your example.

ax3l commented 5 years ago

I reported your example upstream in https://bugs.kde.org/show_bug.cgi?id=397313