robertoraggi / cplusplus

A compiler front end for the C++ language
https://robertoraggi.github.io/cplusplus/
MIT License
123 stars 12 forks source link

Memory usage and permissive parsing with sqlite3 #363

Open mingodad opened 6 months ago

mingodad commented 6 months ago

Trying to parse sqlite3.c with this project I found that it can parse it without showing any error/warnings but uses a bit more memory than gcc/g++ (cxx=3,000,000 allocations, g++=500,000 allocations):

/usr/bin/time ./cxx -toolchain linux ./sqlite3.cpp -fsyntax-only -I.
1.01user 0.06system 0:01.08elapsed 99%CPU (0avgtext+0avgdata 146628maxresident)k
0inputs+0outputs (0major+39686minor)pagefaults 0swaps
/usr/bin/time g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
1.08user 0.07system 0:01.16elapsed 99%CPU (0avgtext+0avgdata 115564maxresident)k
0inputs+0outputs (0major+29614minor)pagefaults 0swaps
/usr/bin/time myvalgrind ./cxx -toolchain linux ./sqlite3.cpp -fsyntax-only -I.
==21479==     in use at exit: 0 bytes in 0 blocks
==21479==   total heap usage: 3,003,109 allocs, 3,003,109 frees, 4,175,102,281 bytes allocated
55.95user 0.31system 0:56.27elapsed 99%CPU (0avgtext+0avgdata 408920maxresident)k
0inputs+0outputs (0major+127697minor)pagefaults 0swaps
/usr/bin/time myvalgrind --trace-children=yes g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
==21623== Command: g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
==21624==     in use at exit: 7,045,865 bytes in 15,051 blocks
==21624==   total heap usage: 490,444 allocs, 475,393 frees, 649,209,160 bytes allocated
==21624== Command: /usr/lib/gcc/x86_64-linux-gnu/9/cc1plus -quiet -I . -imultiarch x86_64-linux-gnu -D_GNU_SOURCE ./sqlite3.cpp -quiet -dumpbase sqlite3.cpp -mtune=generic -march=x86-64 -auxbase sqlite3 -fpermissive -fsyntax-only -o /dev/null -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security
==21623==     in use at exit: 189,449 bytes in 69 blocks
==21623==   total heap usage: 234 allocs, 165 frees, 215,031 bytes allocated
40.27user 0.34system 0:40.63elapsed 99%CPU (0avgtext+0avgdata 337688maxresident)k
4416inputs+0outputs (1major+147633minor)pagefaults 0swaps

But renaming sqlite3.c to sqlite3.cpp and fixing several errors pointed out by g++ then this project still doesn't show any warning/error while g++ needs -fpermissive to parse it and show several warnings:

./sqlite3.cpp: In function ‘void* sqlite3MemMalloc(int)’:
./sqlite3.cpp:26707:44: warning: invalid conversion from ‘void*’ to ‘sqlite3_int64*’ {aka ‘long long int*’} [-fpermissive]
26707 | #define SQLITE_MALLOC(x)             malloc(x)
      |                                      ~~~~~~^~~
      |                                            |
      |                                            void*
./sqlite3.cpp:26773:7: note: in expansion of macro ‘SQLITE_MALLOC’
26773 |   p = SQLITE_MALLOC( nByte+8 );
      |       ^~~~~~~~~~~~~
./sqlite3.cpp: In function ‘void* sqlite3MemRealloc(void*, int)’:
./sqlite3.cpp:26709:45: warning: invalid conversion from ‘void*’ to ‘sqlite3_int64*’ {aka ‘long long int*’} [-fpermissive]
26709 | #define SQLITE_REALLOC(x,y)          realloc((x),(y))
      |                                      ~~~~~~~^~~~~~~~~
      |                                             |
      |                                             void*
...
./sqlite3.cpp: In function ‘int jsonEachNext(sqlite3_vtab_cursor*)’:
./sqlite3.cpp:209852:32: warning: invalid conversion from ‘void*’ to ‘JsonParent*’ [-fpermissive]
209852 |         pNew = sqlite3DbRealloc(p->db, p->aParent, sizeof(JsonParent)*nNew);
       |                ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                |
       |                                void*
./sqlite3.cpp: In function ‘int jsonEachFilter(sqlite3_vtab_cursor*, int, const char*, int, sqlite3_value**)’:
./sqlite3.cpp:210179:37: warning: invalid conversion from ‘void*’ to ‘JsonParent*’ [-fpermissive]
210179 |     p->aParent = sqlite3DbMallocZero(p->db, sizeof(JsonParent));
       |                  ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                     |
       |                                     void*

I'm attaching the files used by this test: sqlite3-c-cpp.zip

mingodad commented 6 months ago

Here is the output of valgrind with clang-17 (1,000,000 allocations):

/usr/bin/time myvalgrind --trace-children=yes clang-17-env clang ./sqlite3.c -c -I.
==18426== Memcheck, a memory error detector
==18426== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18426== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18426== Command: /home/mingo/bin/clang-17-env clang ./sqlite3.c -c -I.
==18426== 
==18427== Memcheck, a memory error detector
==18427== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18427== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18427== Command: /home/mingo/bin/clang-3-env-base clang ./sqlite3.c -c -I.
==18427== 
==18428== Memcheck, a memory error detector
==18428== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18428== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18428== Command: /home/mingo/local/clang-17/bin/clang ./sqlite3.c -c -I.
==18428== 
==18428== 
==18428== HEAP SUMMARY:
==18428==     in use at exit: 44,223,475 bytes in 324,547 blocks
==18428==   total heap usage: 978,232 allocs, 653,685 frees, 536,676,783 bytes allocated
==18428== 
==18428== LEAK SUMMARY:
==18428==    definitely lost: 0 bytes in 0 blocks
==18428==    indirectly lost: 0 bytes in 0 blocks
==18428==      possibly lost: 34,882,928 bytes in 301,772 blocks
==18428==    still reachable: 9,340,547 bytes in 22,775 blocks
==18428==         suppressed: 0 bytes in 0 blocks
==18428== Rerun with --leak-check=full to see details of leaked memory
==18428== 
==18428== For lists of detected and suppressed errors, rerun with: -s
==18428== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18427== 
==18427== HEAP SUMMARY:
==18427==     in use at exit: 48,333 bytes in 775 blocks
==18427==   total heap usage: 1,491 allocs, 716 frees, 83,250 bytes allocated
==18427== 
==18427== LEAK SUMMARY:
==18427==    definitely lost: 281 bytes in 1 blocks
==18427==    indirectly lost: 0 bytes in 0 blocks
==18427==      possibly lost: 0 bytes in 0 blocks
==18427==    still reachable: 48,052 bytes in 774 blocks
==18427==         suppressed: 0 bytes in 0 blocks
==18427== Rerun with --leak-check=full to see details of leaked memory
==18427== 
==18427== For lists of detected and suppressed errors, rerun with: -s
==18427== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18426== 
==18426== HEAP SUMMARY:
==18426==     in use at exit: 44,798 bytes in 692 blocks
==18426==   total heap usage: 974 allocs, 282 frees, 64,546 bytes allocated
==18426== 
==18426== LEAK SUMMARY:
==18426==    definitely lost: 281 bytes in 1 blocks
==18426==    indirectly lost: 0 bytes in 0 blocks
==18426==      possibly lost: 0 bytes in 0 blocks
==18426==    still reachable: 44,517 bytes in 691 blocks
==18426==         suppressed: 0 bytes in 0 blocks
==18426== Rerun with --leak-check=full to see details of leaked memory
==18426== 
==18426== For lists of detected and suppressed errors, rerun with: -s
==18426== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
96.97user 0.55system 1:37.67elapsed 99%CPU (0avgtext+0avgdata 507316maxresident)k
85352inputs+3016outputs (2major+181810minor)pagefaults 0swaps
mingodad commented 6 months ago

Notice that I like that this project can parse C source files but when they are renamed from .c to .cpp I would expect it to be a bit more C++ strict.

robertoraggi commented 6 months ago

@mingodad Most memory is used for temporary tokens and data structures during preprocessing, along with symbols in the scope chain. The AST is stored in an arena, which should be fine. I'm working on code to reduce the preprocessor memory usage, then we should see numbers similar to other C++ frontends.

mingodad commented 6 months ago

Thank you for reply ! Attached is the dirty changes I made to instrument the code to show timing and allocation while parsing sqlite3.c:

fprintf(stderr, "%s: Time taken %d seconds %d milliseconds, %'zd : %'zd : %'zd : %'zd : %'zd : %'zd\n", title, msec/1000, msec%1000, getCurrentRSS(), getPeakRSS(),
            malloc_count, free_count, malloc_count-malloc_prev_count, free_count-free_prev_count);
/usr/bin/time ./cxx -toolchain linux ./sqlite3.c -c -I.
start main: Time taken 0 seconds 1 milliseconds, 2150400 : 2150400 : 247 : 38 : 247 : 38
before readAll: Time taken 0 seconds 1 milliseconds, 4812800 : 4812800 : 4970 : 3211 : 4723 : 3173
after unit.setSource: Time taken 0 seconds 817 milliseconds, 146423808 : 146423808 : 188992 : 144030 : 184022 : 140819
before preprocesor->squeeze: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 188992 : 144031 : 0 : 1
before unit.parse: Time taken 0 seconds 1 milliseconds, 137555968 : 146423808 : 188992 : 150101 : 0 : 6070
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189020 : 150101 : 28 : 0
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:40:1: 1 : 2 : 1266
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189026 : 150104 : 6 : 3
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:99:1: 2 : 2 : 3405
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189038 : 150106 : 12 : 2
./sqlite3.c:500:12: 3 : 2 : 19459
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189056 : 150108 : 18 : 2
./sqlite3.c:501:12: 4 : 2 : 19517
...
declaration: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2453607 : 23 : 15
./sqlite3.c:257674:12: 4589 : 11 : 9088707
after unit.parse: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 1504
before diagnosticsClient.verifyExpectedDiagnostics: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 0
end main: Time taken 0 seconds 44 milliseconds, 129077248 : 150380544 : 3002651 : 3002439 : 0 : 547328
Total tokens = 766945
Total declarations = 4589
1.73user 0.17system 0:01.93elapsed 98%CPU (0avgtext+0avgdata 146856maxresident)k
0inputs+0outputs (0major+41904minor)pagefaults 0swaps

debugmem.zip

mingodad commented 6 months ago

Doing the above experiment I've noticed that using the code shown bellow to show where the declaration start doesn't work in all cases because the Token is reused and only register the initial location, in sqlite3.c most of the functions start with SQLITE_PRIVATE and its definition is at the top of the file.

const Token& tk = unit->tokenAt(declaration->firstSourceLocation());
      unsigned line, column;
      std::string_view fileName;
      unit->preprocessor()->getTokenStartPosition(tk, &line, &column, &fileName);
      printf("%.*s:%u:%u: %zd : %d : %u\n", (int)fileName.size(), fileName.data(), line, column, decl_count, (int)declaration->kind(), tk.offset());fflush(stdout);

There is a way to get the token usage location ?

robertoraggi commented 6 months ago

@mingodad Thanks for sharing your changes, I will have a look at them this weekend.

Getting the ranges of the declarations using first/last SourceLocation should work, that is unless the declarations are generated from macro expansions, in such cases the preprocessor may assign token positions from the macro definitions, this is bug, I should probably fix this.

I wrote a simple Web app to test the tokens and AST locations. Unfortunately the current JavaScript API does not have support to resolve #include directives (I'm working on it), so you will need to run the preprocessor before using it.

cxx -E -P sqlite3.c -toolchain linux -o sqlite3.i

then paste the content of sqlite3.i to https://robertoraggi.github.io/cplusplus/?path=/story/cxxfrontend-syntaxtree--basic move the text cursor (or edit the C++ code), the AST browser will sync with the text editor.

mingodad commented 1 day ago

With the latest changes I'm getting errors when trying to parse sqlite3.c:

/usr/bin/time mycxx -toolchain linux -fsyntax-only sqlite3.c > /dev/null

/usr/include/c++/11/type_traits:3391:23: expected a declaration
    requires __is_enum(_Tp)
                      ^
/usr/include/c++/11/type_traits:3391:27: expected a declaration
    requires __is_enum(_Tp)
                          ^
/usr/include/c++/11/type_traits:3392:5: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
    ^
/usr/include/c++/11/type_traits:3392:8: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
       ^
/usr/include/c++/11/type_traits:3392:16: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
               ^
/usr/include/c++/11/type_traits:3392:24: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                       ^
/usr/include/c++/11/type_traits:3392:26: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                         ^
/usr/include/c++/11/type_traits:3392:32: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                               ^
/usr/include/c++/11/type_traits:3392:37: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                                    ^
/usr/include/c++/11/type_traits:3400:23: expected a declaration
    requires __is_enum(_Tp)
                      ^
/usr/include/c++/11/type_traits:3400:27: expected a declaration
    requires __is_enum(_Tp)
                          ^
/usr/include/c++/11/type_traits:3401:5: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
    ^
/usr/include/c++/11/type_traits:3401:8: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
       ^
/usr/include/c++/11/type_traits:3401:16: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
               ^
/usr/include/c++/11/type_traits:3401:24: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                       ^
/usr/include/c++/11/type_traits:3401:26: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                         ^
/usr/include/c++/11/type_traits:3401:32: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                               ^
/usr/include/c++/11/type_traits:3401:37: expected a declaration
    && requires(_Tp __t) { __t = __t; } // fails if incomplete
                                    ^
/usr/include/c++/11/type_traits:3599:1: expected a declaration
} // namespace std
^
/usr/include/c++/11/cmath:1938:1: expected a declaration
} // extern "C++"
^
robertoraggi commented 1 day ago

Which version of sqlite are you using? I downloaded a random version, and it appears to parse correctly with gcc 11 and 14 headers, as long as I pass the -Dregister=“” to cxx.

mingodad commented 19 hours ago

I isolated the problem with this test: test-cmath.cpp

#include <cmath>

Testing with cxx -toolchain linux -fsyntax-only test-cmath.cpp gives the same output as reported here https://github.com/robertoraggi/cplusplus/issues/363#issuecomment-2470320209 .

I'm attaching (see bellow) the output of:

g++9 -E test-cmath.cpp > cmath-g++9.i
g++14 -E test-cmath.cpp > cmath-g++14.i
clang++19 -E test-cmath.cpp > cmath-clang++19.i
cxx -toolchain linux -E test-cmath.cpp > cmath-cxx.i

Also when trying to parse the preprocessed files cxx can't parse the one generated by itself:

cxx -toolchain linux -fsyntax-only cmath-cxx.i
cmath-cxx.i:3579:5: expected a name
    [ [ deprecated ( "use is_standard_layout && is_trivial instead") ] ]
    ^
...
-------
cxx -toolchain linux -fsyntax-only cmath-g++9.i
#parse ok
-------
cxx -toolchain linux -fsyntax-only cmath-g++14.i
cmath-g++14.i:9121:39: expected ';'
    { using type = __type_pack_element<_Np, _Types...>; };
                                      ^
...
-------
cxx -toolchain linux -fsyntax-only cmath-clang++19.i
#parse ok

cmath.zip

robertoraggi commented 10 hours ago
cxx -toolchain linux -fsyntax-only cmath-cxx.i
cmath-cxx.i:3579:5: expected a name
    [ [ deprecated ( "use is_standard_layout && is_trivial instead") ] ]

This is the parser being too strict and not allowing spaces between brackets in C++ attributes. There are instances where the preprocessor introduces spaces, which can break the parsing of the preprocessor’s output. It should be fixed in https://github.com/robertoraggi/cplusplus/pull/416

You can’t parse the output preprocessed by gcc with cxx because gcc has a few built-in functions that I haven’t implemented yet (for example, __reference_constructs_from_temporary and __type_pack_element).

(https://github.com/robertoraggi/cplusplus/blob/main/.github/workflows/ci.yml#L54) I have a test in the CI pipeline for this. I’m sure that cmath is included when parsing the repository.

robertoraggi commented 10 hours ago

On the other hand, parsing the output of cxx's preprocessor with gcc might work, but it all depends on which built-in (tested with __has_builtin) gcc is expecting. You may need to pass -w to gcc to avoid reporting warning from system headers.

echo '#include <iostream>' | ./build/src/frontend/cxx -toolchain linux -E - | g++ -fsyntax-only -xc++ -std=c++26 -w -