Open mingodad opened 6 months ago
Here is the output of valgrind
with clang-17
(1,000,000 allocations):
/usr/bin/time myvalgrind --trace-children=yes clang-17-env clang ./sqlite3.c -c -I.
==18426== Memcheck, a memory error detector
==18426== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18426== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18426== Command: /home/mingo/bin/clang-17-env clang ./sqlite3.c -c -I.
==18426==
==18427== Memcheck, a memory error detector
==18427== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18427== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18427== Command: /home/mingo/bin/clang-3-env-base clang ./sqlite3.c -c -I.
==18427==
==18428== Memcheck, a memory error detector
==18428== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==18428== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==18428== Command: /home/mingo/local/clang-17/bin/clang ./sqlite3.c -c -I.
==18428==
==18428==
==18428== HEAP SUMMARY:
==18428== in use at exit: 44,223,475 bytes in 324,547 blocks
==18428== total heap usage: 978,232 allocs, 653,685 frees, 536,676,783 bytes allocated
==18428==
==18428== LEAK SUMMARY:
==18428== definitely lost: 0 bytes in 0 blocks
==18428== indirectly lost: 0 bytes in 0 blocks
==18428== possibly lost: 34,882,928 bytes in 301,772 blocks
==18428== still reachable: 9,340,547 bytes in 22,775 blocks
==18428== suppressed: 0 bytes in 0 blocks
==18428== Rerun with --leak-check=full to see details of leaked memory
==18428==
==18428== For lists of detected and suppressed errors, rerun with: -s
==18428== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18427==
==18427== HEAP SUMMARY:
==18427== in use at exit: 48,333 bytes in 775 blocks
==18427== total heap usage: 1,491 allocs, 716 frees, 83,250 bytes allocated
==18427==
==18427== LEAK SUMMARY:
==18427== definitely lost: 281 bytes in 1 blocks
==18427== indirectly lost: 0 bytes in 0 blocks
==18427== possibly lost: 0 bytes in 0 blocks
==18427== still reachable: 48,052 bytes in 774 blocks
==18427== suppressed: 0 bytes in 0 blocks
==18427== Rerun with --leak-check=full to see details of leaked memory
==18427==
==18427== For lists of detected and suppressed errors, rerun with: -s
==18427== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18426==
==18426== HEAP SUMMARY:
==18426== in use at exit: 44,798 bytes in 692 blocks
==18426== total heap usage: 974 allocs, 282 frees, 64,546 bytes allocated
==18426==
==18426== LEAK SUMMARY:
==18426== definitely lost: 281 bytes in 1 blocks
==18426== indirectly lost: 0 bytes in 0 blocks
==18426== possibly lost: 0 bytes in 0 blocks
==18426== still reachable: 44,517 bytes in 691 blocks
==18426== suppressed: 0 bytes in 0 blocks
==18426== Rerun with --leak-check=full to see details of leaked memory
==18426==
==18426== For lists of detected and suppressed errors, rerun with: -s
==18426== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
96.97user 0.55system 1:37.67elapsed 99%CPU (0avgtext+0avgdata 507316maxresident)k
85352inputs+3016outputs (2major+181810minor)pagefaults 0swaps
Notice that I like that this project can parse C
source files but when they are renamed from .c
to .cpp
I would expect it to be a bit more C++
strict.
@mingodad Most memory is used for temporary tokens and data structures during preprocessing, along with symbols in the scope chain. The AST is stored in an arena, which should be fine. I'm working on code to reduce the preprocessor memory usage, then we should see numbers similar to other C++ frontends.
Thank you for reply !
Attached is the dirty changes I made to instrument the code to show timing and allocation while parsing sqlite3.c
:
fprintf(stderr, "%s: Time taken %d seconds %d milliseconds, %'zd : %'zd : %'zd : %'zd : %'zd : %'zd\n", title, msec/1000, msec%1000, getCurrentRSS(), getPeakRSS(),
malloc_count, free_count, malloc_count-malloc_prev_count, free_count-free_prev_count);
/usr/bin/time ./cxx -toolchain linux ./sqlite3.c -c -I.
start main: Time taken 0 seconds 1 milliseconds, 2150400 : 2150400 : 247 : 38 : 247 : 38
before readAll: Time taken 0 seconds 1 milliseconds, 4812800 : 4812800 : 4970 : 3211 : 4723 : 3173
after unit.setSource: Time taken 0 seconds 817 milliseconds, 146423808 : 146423808 : 188992 : 144030 : 184022 : 140819
before preprocesor->squeeze: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 188992 : 144031 : 0 : 1
before unit.parse: Time taken 0 seconds 1 milliseconds, 137555968 : 146423808 : 188992 : 150101 : 0 : 6070
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189020 : 150101 : 28 : 0
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:40:1: 1 : 2 : 1266
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189026 : 150104 : 6 : 3
/usr/lib/gcc/x86_64-linux-gnu/11/include/stdarg.h:99:1: 2 : 2 : 3405
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189038 : 150106 : 12 : 2
./sqlite3.c:500:12: 3 : 2 : 19459
declaration: Time taken 0 seconds 0 milliseconds, 137555968 : 146423808 : 189056 : 150108 : 18 : 2
./sqlite3.c:501:12: 4 : 2 : 19517
...
declaration: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2453607 : 23 : 15
./sqlite3.c:257674:12: 4589 : 11 : 9088707
after unit.parse: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 1504
before diagnosticsClient.verifyExpectedDiagnostics: Time taken 0 seconds 0 milliseconds, 150380544 : 150380544 : 3002651 : 2455111 : 0 : 0
end main: Time taken 0 seconds 44 milliseconds, 129077248 : 150380544 : 3002651 : 3002439 : 0 : 547328
Total tokens = 766945
Total declarations = 4589
1.73user 0.17system 0:01.93elapsed 98%CPU (0avgtext+0avgdata 146856maxresident)k
0inputs+0outputs (0major+41904minor)pagefaults 0swaps
Doing the above experiment I've noticed that using the code shown bellow to show where the declaration start doesn't work in all cases because the Token
is reused and only register the initial location, in sqlite3.c
most of the functions start with SQLITE_PRIVATE
and its definition is at the top of the file.
const Token& tk = unit->tokenAt(declaration->firstSourceLocation());
unsigned line, column;
std::string_view fileName;
unit->preprocessor()->getTokenStartPosition(tk, &line, &column, &fileName);
printf("%.*s:%u:%u: %zd : %d : %u\n", (int)fileName.size(), fileName.data(), line, column, decl_count, (int)declaration->kind(), tk.offset());fflush(stdout);
There is a way to get the token usage location ?
@mingodad Thanks for sharing your changes, I will have a look at them this weekend.
Getting the ranges of the declarations using first/last SourceLocation should work, that is unless the declarations are generated from macro expansions, in such cases the preprocessor may assign token positions from the macro definitions, this is bug, I should probably fix this.
I wrote a simple Web app to test the tokens and AST locations. Unfortunately the current JavaScript API does not have support to resolve #include
directives (I'm working on it), so you will need to run the preprocessor before using it.
cxx -E -P sqlite3.c -toolchain linux -o sqlite3.i
then paste the content of sqlite3.i to https://robertoraggi.github.io/cplusplus/?path=/story/cxxfrontend-syntaxtree--basic move the text cursor (or edit the C++ code), the AST browser will sync with the text editor.
With the latest changes I'm getting errors when trying to parse sqlite3.c
:
/usr/bin/time mycxx -toolchain linux -fsyntax-only sqlite3.c > /dev/null
/usr/include/c++/11/type_traits:3391:23: expected a declaration
requires __is_enum(_Tp)
^
/usr/include/c++/11/type_traits:3391:27: expected a declaration
requires __is_enum(_Tp)
^
/usr/include/c++/11/type_traits:3392:5: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:8: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:16: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:24: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:26: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:32: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3392:37: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3400:23: expected a declaration
requires __is_enum(_Tp)
^
/usr/include/c++/11/type_traits:3400:27: expected a declaration
requires __is_enum(_Tp)
^
/usr/include/c++/11/type_traits:3401:5: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:8: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:16: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:24: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:26: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:32: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3401:37: expected a declaration
&& requires(_Tp __t) { __t = __t; } // fails if incomplete
^
/usr/include/c++/11/type_traits:3599:1: expected a declaration
} // namespace std
^
/usr/include/c++/11/cmath:1938:1: expected a declaration
} // extern "C++"
^
Which version of sqlite are you using? I downloaded a random version, and it appears to parse correctly with gcc 11 and 14 headers, as long as I pass the -Dregister=“”
to cxx.
I isolated the problem with this test: test-cmath.cpp
#include <cmath>
Testing with cxx -toolchain linux -fsyntax-only test-cmath.cpp
gives the same output as reported here https://github.com/robertoraggi/cplusplus/issues/363#issuecomment-2470320209 .
I'm attaching (see bellow) the output of:
g++9 -E test-cmath.cpp > cmath-g++9.i
g++14 -E test-cmath.cpp > cmath-g++14.i
clang++19 -E test-cmath.cpp > cmath-clang++19.i
cxx -toolchain linux -E test-cmath.cpp > cmath-cxx.i
Also when trying to parse the preprocessed files cxx
can't parse the one generated by itself:
cxx -toolchain linux -fsyntax-only cmath-cxx.i
cmath-cxx.i:3579:5: expected a name
[ [ deprecated ( "use is_standard_layout && is_trivial instead") ] ]
^
...
-------
cxx -toolchain linux -fsyntax-only cmath-g++9.i
#parse ok
-------
cxx -toolchain linux -fsyntax-only cmath-g++14.i
cmath-g++14.i:9121:39: expected ';'
{ using type = __type_pack_element<_Np, _Types...>; };
^
...
-------
cxx -toolchain linux -fsyntax-only cmath-clang++19.i
#parse ok
cxx -toolchain linux -fsyntax-only cmath-cxx.i
cmath-cxx.i:3579:5: expected a name
[ [ deprecated ( "use is_standard_layout && is_trivial instead") ] ]
This is the parser being too strict and not allowing spaces between brackets in C++ attributes. There are instances where the preprocessor introduces spaces, which can break the parsing of the preprocessor’s output. It should be fixed in https://github.com/robertoraggi/cplusplus/pull/416
You can’t parse the output preprocessed by gcc with cxx because gcc has a few built-in functions that I haven’t implemented yet (for example, __reference_constructs_from_temporary
and __type_pack_element
).
(https://github.com/robertoraggi/cplusplus/blob/main/.github/workflows/ci.yml#L54)
I have a test in the CI pipeline for this. I’m sure that cmath
is included when parsing the repository.
On the other hand, parsing the output of cxx's preprocessor with gcc might work, but it all depends on which built-in (tested with __has_builtin) gcc is expecting. You may need to pass -w to gcc to avoid reporting warning from system headers.
echo '#include <iostream>' | ./build/src/frontend/cxx -toolchain linux -E - | g++ -fsyntax-only -xc++ -std=c++26 -w -
Trying to parse
sqlite3.c
with this project I found that it can parse it without showing any error/warnings but uses a bit more memory thangcc/g++
(cxx=3,000,000 allocations, g++=500,000 allocations):But renaming
sqlite3.c
tosqlite3.cpp
and fixing several errors pointed out byg++
then this project still doesn't show any warning/error whileg++
needs-fpermissive
to parse it and show several warnings:I'm attaching the files used by this test: sqlite3-c-cpp.zip