vincentlaucsb / csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.
MIT License
864 stars 144 forks source link

AddressSanitizer: heap-use-after-free on large dataset #217

Closed JonasKellerer closed 3 weeks ago

JonasKellerer commented 11 months ago

We have tried to use the csv-parser on a large dataset (8 million lines at 9,9 GB). However when looping over all lines and exectue row[column_name].get<std::string>() we get the following error message

` ==245==ERROR: AddressSanitizer: heap-use-after-free on address 0x621003c37248 at pc 0x56492659e7ee bp 0x7ffe476e2f20 sp 0x7ffe476e2f10 READ of size 8 at 0x621003c37248 thread T0

0 0x56492659e7ed in csv::internals::CSVFieldList::operator[](unsigned long) const /mwe/includes/csv_reader.h:7635

#1 0x56492659f298 in csv::CSVRow::get_field(unsigned long) const /mwe/includes/csv_reader.h:7694
#2 0x56492659ea9d in csv::CSVRow::operator[](unsigned long) const /mwe/includes/csv_reader.h:7656
#3 0x56492659ebea in csv::CSVRow::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const /mwe/includes/csv_reader.h:7672
#4 0x5649265927c2 in getColumn(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /mwe/src/main.cpp:27
#5 0x564926592eb0 in main /mwe/src/main.cpp:36
#6 0x7f66a36d0d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
#7 0x7f66a36d0e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f)
#8 0x564926591dc4 in _start (/mwe/build/csvMWE+0x7dc4)

0x621003c37248 is located 328 bytes inside of 4096-byte region [0x621003c37100,0x621003c38100) freed by thread T107 here:

0 0x7f66a3cb722f in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172

#1 0x5649265be954 in __gnu_cxx::new_allocator<csv::internals::RawCSVField*>::deallocate(csv::internals::RawCSVField**, unsigned long) /usr/include/c++/11/ext/new_allocator.h:145
#2 0x5649265b31d6 in std::allocator<csv::internals::RawCSVField*>::deallocate(csv::internals::RawCSVField**, unsigned long) /usr/include/c++/11/bits/allocator.h:199
#3 0x5649265b31d6 in std::allocator_traits<std::allocator<csv::internals::RawCSVField*> >::deallocate(std::allocator<csv::internals::RawCSVField*>&, csv::internals::RawCSVField**, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:496
#4 0x5649265aa73f in std::_Vector_base<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::_M_deallocate(csv::internals::RawCSVField**, unsigned long) /usr/include/c++/11/bits/stl_vector.h:354
#5 0x5649265b0692 in void std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::_M_realloc_insert<csv::internals::RawCSVField* const&>(__gnu_cxx::__normal_iterator<csv::internals::RawCSVField**, std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> > >, csv::internals::RawCSVField* const&) /usr/include/c++/11/bits/vector.tcc:500
#6 0x5649265a6ef2 in std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::push_back(csv::internals::RawCSVField* const&) /usr/include/c++/11/bits/stl_vector.h:1198
#7 0x56492659e97c in csv::internals::CSVFieldList::allocate() /mwe/includes/csv_reader.h:7640
#8 0x5649265a3b65 in void csv::internals::CSVFieldList::emplace_back<unsigned int, unsigned long&>(unsigned int&&, unsigned long&) /mwe/includes/csv_reader.h:5478
#9 0x564926598700 in csv::internals::IBasicCSVParser::push_field() /mwe/includes/csv_reader.h:6972
#10 0x564926598c01 in csv::internals::IBasicCSVParser::parse() /mwe/includes/csv_reader.h:6999
#11 0x5649265c8b49 in csv::internals::StreamParser<std::basic_ifstream<char, std::char_traits<char> > >::next(unsigned long) /mwe/includes/csv_reader.h:6175
#12 0x56492659ceb9 in csv::CSVReader::read_csv(unsigned long) /mwe/includes/csv_reader.h:7496
#13 0x5649265c9335 in bool std::__invoke_impl<bool, bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>(std::__invoke_memfun_deref, bool (csv::CSVReader::*&&)(unsigned long), csv::CSVReader*&&, unsigned long&&) /usr/include/c++/11/bits/invoke.h:74
#14 0x5649265c913e in std::__invoke_result<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>::type std::__invoke<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>(bool (csv::CSVReader::*&&)(unsigned long), csv::CSVReader*&&, unsigned long&&) /usr/include/c++/11/bits/invoke.h:96
#15 0x5649265c905e in bool std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) /usr/include/c++/11/bits/std_thread.h:253
#16 0x5649265c8ec1 in std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> >::operator()() /usr/include/c++/11/bits/std_thread.h:260
#17 0x5649265c8dbd in std::thread::_State_impl<std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> > >::_M_run() /usr/include/c++/11/bits/std_thread.h:211
#18 0x7f66a3ab22b2  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc2b2)

previously allocated by thread T107 here:

0 0x7f66a3cb61c7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99

#1 0x5649265c49fd in __gnu_cxx::new_allocator<csv::internals::RawCSVField*>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
#2 0x5649265bd7a6 in std::allocator<csv::internals::RawCSVField*>::allocate(unsigned long) /usr/include/c++/11/bits/allocator.h:185
#3 0x5649265bd7a6 in std::allocator_traits<std::allocator<csv::internals::RawCSVField*> >::allocate(std::allocator<csv::internals::RawCSVField*>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
#4 0x5649265b779f in std::_Vector_base<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
#5 0x5649265b0514 in void std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::_M_realloc_insert<csv::internals::RawCSVField* const&>(__gnu_cxx::__normal_iterator<csv::internals::RawCSVField**, std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> > >, csv::internals::RawCSVField* const&) /usr/include/c++/11/bits/vector.tcc:440
#6 0x5649265a6ef2 in std::vector<csv::internals::RawCSVField*, std::allocator<csv::internals::RawCSVField*> >::push_back(csv::internals::RawCSVField* const&) /usr/include/c++/11/bits/stl_vector.h:1198
#7 0x56492659e97c in csv::internals::CSVFieldList::allocate() /mwe/includes/csv_reader.h:7640
#8 0x5649265a3b65 in void csv::internals::CSVFieldList::emplace_back<unsigned int, unsigned long&>(unsigned int&&, unsigned long&) /mwe/includes/csv_reader.h:5478
#9 0x564926598700 in csv::internals::IBasicCSVParser::push_field() /mwe/includes/csv_reader.h:6972
#10 0x564926598c01 in csv::internals::IBasicCSVParser::parse() /mwe/includes/csv_reader.h:6999
#11 0x5649265c8b49 in csv::internals::StreamParser<std::basic_ifstream<char, std::char_traits<char> > >::next(unsigned long) /mwe/includes/csv_reader.h:6175
#12 0x56492659ceb9 in csv::CSVReader::read_csv(unsigned long) /mwe/includes/csv_reader.h:7496
#13 0x5649265c9335 in bool std::__invoke_impl<bool, bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>(std::__invoke_memfun_deref, bool (csv::CSVReader::*&&)(unsigned long), csv::CSVReader*&&, unsigned long&&) /usr/include/c++/11/bits/invoke.h:74
#14 0x5649265c913e in std::__invoke_result<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>::type std::__invoke<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long>(bool (csv::CSVReader::*&&)(unsigned long), csv::CSVReader*&&, unsigned long&&) /usr/include/c++/11/bits/invoke.h:96
#15 0x5649265c905e in bool std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) /usr/include/c++/11/bits/std_thread.h:253
#16 0x5649265c8ec1 in std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> >::operator()() /usr/include/c++/11/bits/std_thread.h:260
#17 0x5649265c8dbd in std::thread::_State_impl<std::thread::_Invoker<std::tuple<bool (csv::CSVReader::*)(unsigned long), csv::CSVReader*, unsigned long> > >::_M_run() /usr/include/c++/11/bits/std_thread.h:211
#18 0x7f66a3ab22b2  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc2b2)

Thread T107 created by T0 here:

0 0x7f66a3c58685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216

#1 0x7f66a3ab2388 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc388)
#2 0x56492659d23d in csv::CSVReader::read_row(csv::CSVRow&) /mwe/includes/csv_reader.h:7536
#3 0x56492659e70a in csv::CSVReader::iterator::operator++() /mwe/includes/csv_reader.h:7605
#4 0x5649265928ad in getColumn(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /mwe/src/main.cpp:25
#5 0x564926592eb0 in main /mwe/src/main.cpp:36
#6 0x7f66a36d0d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)

SUMMARY: AddressSanitizer: heap-use-after-free /mwe/includes/csv_reader.h:7635 in csv::internals::CSVFieldList::operator[](unsigned long) const Shadow bytes around the buggy address: 0x0c428077edf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c428077ee00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c428077ee10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c428077ee20: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c428077ee30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd =>0x0c428077ee40: fd fd fd fd fd fd fd fd fd[fd]fd fd fd fd fd fd 0x0c428077ee50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c428077ee60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c428077ee70: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c428077ee80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c428077ee90: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==245==ABORTING `

The problem can be fixed, when using std::this_thread::sleep_for(std::chrono::nanoseconds(1)); in the same loop.

For reproduceability, I have put a MWE here: https://drive.google.com/file/d/1M_PJLlhxs8JTmIGEcDNCBAeBqxqmdNBC/view?usp=drive_link

Just extract it and run docker build . --tag=mwe, then docker run -it mwe and inside the container ./runAndBuild.sh.

vincentlaucsb commented 3 months ago

Thanks for your report, I'll take a look

vincentlaucsb commented 3 weeks ago

Should be fixed in the latest release