robertoraggi / cplusplus

A compiler front end for the C++ language
https://robertoraggi.github.io/cplusplus/
MIT License
118 stars 11 forks source link

Memory usage and permissive parsing with sqlite3 #363

Open mingodad opened 1 month ago

mingodad commented 1 month ago

Trying to parse sqlite3.c with this project I found that it can parse it without showing any error/warnings but uses a bit more memory than gcc/g++ (cxx=3,000,000 allocations, g++=500,000 allocations):

/usr/bin/time ./cxx -toolchain linux ./sqlite3.cpp -fsyntax-only -I.
1.01user 0.06system 0:01.08elapsed 99%CPU (0avgtext+0avgdata 146628maxresident)k
0inputs+0outputs (0major+39686minor)pagefaults 0swaps
/usr/bin/time g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
1.08user 0.07system 0:01.16elapsed 99%CPU (0avgtext+0avgdata 115564maxresident)k
0inputs+0outputs (0major+29614minor)pagefaults 0swaps
/usr/bin/time myvalgrind ./cxx -toolchain linux ./sqlite3.cpp -fsyntax-only -I.
==21479==     in use at exit: 0 bytes in 0 blocks
==21479==   total heap usage: 3,003,109 allocs, 3,003,109 frees, 4,175,102,281 bytes allocated
55.95user 0.31system 0:56.27elapsed 99%CPU (0avgtext+0avgdata 408920maxresident)k
0inputs+0outputs (0major+127697minor)pagefaults 0swaps
/usr/bin/time myvalgrind --trace-children=yes g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
==21623== Command: g++ -fpermissive ./sqlite3.cpp -fsyntax-only -I.
==21624==     in use at exit: 7,045,865 bytes in 15,051 blocks
==21624==   total heap usage: 490,444 allocs, 475,393 frees, 649,209,160 bytes allocated
==21624== Command: /usr/lib/gcc/x86_64-linux-gnu/9/cc1plus -quiet -I . -imultiarch x86_64-linux-gnu -D_GNU_SOURCE ./sqlite3.cpp -quiet -dumpbase sqlite3.cpp -mtune=generic -march=x86-64 -auxbase sqlite3 -fpermissive -fsyntax-only -o /dev/null -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security
==21623==     in use at exit: 189,449 bytes in 69 blocks
==21623==   total heap usage: 234 allocs, 165 frees, 215,031 bytes allocated
40.27user 0.34system 0:40.63elapsed 99%CPU (0avgtext+0avgdata 337688maxresident)k
4416inputs+0outputs (1major+147633minor)pagefaults 0swaps

But renaming sqlite3.c to sqlite3.cpp and fixing several errors pointed out by g++ then this project still doesn't show any warning/error while g++ needs -fpermissive to parse it and show several warnings:

./sqlite3.cpp: In function ‘void* sqlite3MemMalloc(int)’:
./sqlite3.cpp:26707:44: warning: invalid conversion from ‘void*’ to ‘sqlite3_int64*’ {aka ‘long long int*’} [-fpermissive]
26707 | #define SQLITE_MALLOC(x)             malloc(x)
      |                                      ~~~~~~^~~
      |                                            |
      |                                            void*
./sqlite3.cpp:26773:7: note: in expansion of macro ‘SQLITE_MALLOC’
26773 |   p = SQLITE_MALLOC( nByte+8 );
      |       ^~~~~~~~~~~~~
./sqlite3.cpp: In function ‘void* sqlite3MemRealloc(void*, int)’:
./sqlite3.cpp:26709:45: warning: invalid conversion from ‘void*’ to ‘sqlite3_int64*’ {aka ‘long long int*’} [-fpermissive]
26709 | #define SQLITE_REALLOC(x,y)          realloc((x),(y))
      |                                      ~~~~~~~^~~~~~~~~
      |                                             |
      |                                             void*
...
./sqlite3.cpp: In function ‘int jsonEachNext(sqlite3_vtab_cursor*)’:
./sqlite3.cpp:209852:32: warning: invalid conversion from ‘void*’ to ‘JsonParent*’ [-fpermissive]
209852 |         pNew = sqlite3DbRealloc(p->db, p->aParent, sizeof(JsonParent)*nNew);
       |                ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                |
       |                                void*
./sqlite3.cpp: In function ‘int jsonEachFilter(sqlite3_vtab_cursor*, int, const char*, int, sqlite3_value**)’:
./sqlite3.cpp:210179:37: warning: invalid conversion from ‘void*’ to ‘JsonParent*’ [-fpermissive]
210179 |     p->aParent = sqlite3DbMallocZero(p->db, sizeof(JsonParent));
       |                  ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                     |
       |                                     void*

I'm attaching the files used by this test: sqlite3-c-cpp.zip

mingodad commented 1 month ago

Here is the output of valgrind with clang-17 (1,000,000 allocations):

/usr/bin/time myvalgrind --trace-children=yes clang-17-env clang ./sqlite3.c -c -I.
==18426== Memcheck, a memory error detector
==18426== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18426== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18426== Command: /home/mingo/bin/clang-17-env clang ./sqlite3.c -c -I.
==18426== 
==18427== Memcheck, a memory error detector
==18427== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18427== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18427== Command: /home/mingo/bin/clang-3-env-base clang ./sqlite3.c -c -I.
==18427== 
==18428== Memcheck, a memory error detector
==18428== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18428== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18428== Command: /home/mingo/local/clang-17/bin/clang ./sqlite3.c -c -I.
==18428== 
==18428== 
==18428== HEAP SUMMARY:
==18428==     in use at exit: 44,223,475 bytes in 324,547 blocks
==18428==   total heap usage: 978,232 allocs, 653,685 frees, 536,676,783 bytes allocated
==18428== 
==18428== LEAK SUMMARY:
==18428==    definitely lost: 0 bytes in 0 blocks
==18428==    indirectly lost: 0 bytes in 0 blocks
==18428==      possibly lost: 34,882,928 bytes in 301,772 blocks
==18428==    still reachable: 9,340,547 bytes in 22,775 blocks
==18428==         suppressed: 0 bytes in 0 blocks
==18428== Rerun with --leak-check=full to see details of leaked memory
==18428== 
==18428== For lists of detected and suppressed errors, rerun with: -s
==18428== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18427== 
==18427== HEAP SUMMARY:
==18427==     in use at exit: 48,333 bytes in 775 blocks
==18427==   total heap usage: 1,491 allocs, 716 frees, 83,250 bytes allocated
==18427== 
==18427== LEAK SUMMARY:
==18427==    definitely lost: 281 bytes in 1 blocks
==18427==    indirectly lost: 0 bytes in 0 blocks
==18427==      possibly lost: 0 bytes in 0 blocks
==18427==    still reachable: 48,052 bytes in 774 blocks
==18427==         suppressed: 0 bytes in 0 blocks
==18427== Rerun with --leak-check=full to see details of leaked memory
==18427== 
==18427== For lists of detected and suppressed errors, rerun with: -s
==18427== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18426== 
==18426== HEAP SUMMARY:
==18426==     in use at exit: 44,798 bytes in 692 blocks
==18426==   total heap usage: 974 allocs, 282 frees, 64,546 bytes allocated
==18426== 
==18426== LEAK SUMMARY:
==18426==    definitely lost: 281 bytes in 1 blocks
==18426==    indirectly lost: 0 bytes in 0 blocks
==18426==      possibly lost: 0 bytes in 0 blocks
==18426==    still reachable: 44,517 bytes in 691 blocks
==18426==         suppressed: 0 bytes in 0 blocks
==18426== Rerun with --leak-check=full to see details of leaked memory
==18426== 
==18426== For lists of detected and suppressed errors, rerun with: -s
==18426== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
96.97user 0.55system 1:37.67elapsed 99%CPU (0avgtext+0avgdata 507316maxresident)k
85352inputs+3016outputs (2major+181810minor)pagefaults 0swaps
mingodad commented 1 month ago

Notice that I like that this project can parse C source files but when they are renamed from .c to .cpp I would expect it to be a bit more C++ strict.

robertoraggi commented 1 month ago

@mingodad Most memory is used for temporary tokens and data structures during preprocessing, along with symbols in the scope chain. The AST is stored in an arena, which should be fine. I'm working on code to reduce the preprocessor memory usage, then we should see numbers similar to other C++ frontends.

mingodad commented 1 month ago

Thank you for reply ! Attached is the dirty changes I made to instrument the code to show timing and allocation while parsing sqlite3.c:

fprintf(stderr, "%s: Time taken %d seconds %d milliseconds, %'zd : %'zd : %'zd : %'zd : %'zd : %'zd\n", title, msec/1000, msec%1000, getCurrentRSS(), getPeakRSS(),
            malloc_count, free_count, malloc_count-malloc_prev_count, free_count-free_prev_count);
/usr/bin/time ./cxx -toolchain linux ./sqlite3.c -c -I.
start main: Time taken 0 seconds 1 milliseconds, 2150400 : 2150400 : 247 : 38 : 247 : 38
before readAll: Time taken 0 seconds 1 milliseconds, 4812800 : 4812800 : 4970 : 3211 : 4723 : 3173
after unit.setSource: Time taken 0 seconds 817 milliseconds, 146423808 : 146423808 : 188992 : 144030 : 184022 : 140819
before preprocesor->squeeze: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 188992 : 144031 : 0 : 1
before unit.parse: Time taken 0 seconds 1 milliseconds, 137555968 : 146423808 : 188992 : 150101 : 0 : 6070
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189020 : 150101 : 28 : 0
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:40:1: 1 : 2 : 1266
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189026 : 150104 : 6 : 3
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:99:1: 2 : 2 : 3405
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189038 : 150106 : 12 : 2
./sqlite3.c:500:12: 3 : 2 : 19459
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189056 : 150108 : 18 : 2
./sqlite3.c:501:12: 4 : 2 : 19517
...
declaration: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2453607 : 23 : 15
./sqlite3.c:257674:12: 4589 : 11 : 9088707
after unit.parse: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 1504
before diagnosticsClient.verifyExpectedDiagnostics: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 0
end main: Time taken 0 seconds 44 milliseconds, 129077248 : 150380544 : 3002651 : 3002439 : 0 : 547328
Total tokens = 766945
Total declarations = 4589
1.73user 0.17system 0:01.93elapsed 98%CPU (0avgtext+0avgdata 146856maxresident)k
0inputs+0outputs (0major+41904minor)pagefaults 0swaps

debugmem.zip

mingodad commented 1 month ago

Doing the above experiment I've noticed that using the code shown bellow to show where the declaration start doesn't work in all cases because the Token is reused and only register the initial location, in sqlite3.c most of the functions start with SQLITE_PRIVATE and its definition is at the top of the file.

const Token& tk = unit->tokenAt(declaration->firstSourceLocation());
      unsigned line, column;
      std::string_view fileName;
      unit->preprocessor()->getTokenStartPosition(tk, &line, &column, &fileName);
      printf("%.*s:%u:%u: %zd : %d : %u\n", (int)fileName.size(), fileName.data(), line, column, decl_count, (int)declaration->kind(), tk.offset());fflush(stdout);

There is a way to get the token usage location ?

robertoraggi commented 1 month ago

@mingodad Thanks for sharing your changes, I will have a look at them this weekend.

Getting the ranges of the declarations using first/last SourceLocation should work, that is unless the declarations are generated from macro expansions, in such cases the preprocessor may assign token positions from the macro definitions, this is bug, I should probably fix this.

I wrote a simple Web app to test the tokens and AST locations. Unfortunately the current JavaScript API does not have support to resolve #include directives (I'm working on it), so you will need to run the preprocessor before using it.

cxx -E -P sqlite3.c -toolchain linux -o sqlite3.i

then paste the content of sqlite3.i to https://robertoraggi.github.io/cplusplus/?path=/story/cxxfrontend-syntaxtree--basic move the text cursor (or edit the C++ code), the AST browser will sync with the text editor.