stardot / beebasm

A portable 6502 assembler with BBC Micro style syntax
http://www.retrosoftware.co.uk/wiki/index.php/BeebAsm
GNU General Public License v3.0
83 stars 26 forks source link

Long variable names significantly slow down assembly #100

Open dp111 opened 1 month ago

dp111 commented 1 month ago

I've been making variable names a little more descriptive in MMFS so instead of MA+&1107 -> channeldata_directory_locked . Assembly has significantly slowed down now.

mungre commented 1 month ago

Can you provide a link to before and after commits for comparison?

mungre commented 1 month ago

beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes.

In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick.

However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path.

I've implemented this in mungre/simplesuffix. Can you try it out?

This will be quicker, but I don't know if it accounts for a significant part of your slowdown.

dp111 commented 1 month ago

I've just given it ago. Unfortunately, I can't detect a difference. I'll get my github repo into a position I can show the difference.

On Sun, 26 May 2024 at 18:49, mungre @.***> wrote:

beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes.

In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick.

However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path.

I've implemented this in mungre/simplesuffix https://github.com/stardot/beebasm/tree/mungre/simplesuffix. Can you try it out?

This will be quicker, but I don't know if it accounts for a significant part of your slowdown.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2132298029, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFISK3OOVD4V5YBJCVSLZEIOBLAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSGI4TQMBSHE . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 1 month ago

Cheers. It was a bit of a punt. People always say that you don't know anything until you measure performance problems.

mungre commented 1 month ago

I used Windows Performance Analyzer to dig into it and I timed the assembly of bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more:

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line

So it's considerably faster now. These changes are also in mungre/simplesuffix.

There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once.

dp111 commented 1 month ago

Wow! amazing work ! I'll have a play later

On Thu, 30 May 2024 at 15:57, mungre @.***> wrote:

I used Windows Performance A nalyzer to dig into it and I timed the assembly of bogomandel https://github.com/davidgiven/bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more:

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line

So it's considerably faster now. These changes are also in mungre/simplesuffix https://github.com/stardot/beebasm/tree/mungre/simplesuffix.

There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2139813335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU . You are receiving this because you authored the thread.Message ID: @.***>

dp111 commented 1 month ago

Thanks for your work. MMFS now builds all versions in 8.5 seconds. under WSL2 real 0m8.576s user 0m4.949s sys 0m1.229s

On Thu, 30 May 2024 at 16:09, Dominic Plunkett @.***> wrote:

Wow! amazing work ! I'll have a play later

On Thu, 30 May 2024 at 15:57, mungre @.***> wrote:

I used Windows Performance A nalyzer to dig into it and I timed the assembly of bogomandel https://github.com/davidgiven/bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more:

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line

So it's considerably faster now. These changes are also in mungre/simplesuffix https://github.com/stardot/beebasm/tree/mungre/simplesuffix.

There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2139813335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 1 month ago

Nice one. I guess that's quicker! Thanks for trying it out.

I did another simple change to read the whole source file in one go. It knocked another third off the run-time.

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load whole source file into a string

It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time.

dp111 commented 1 month ago

Just tested the latest version, I can't detect any significant change in run time now . So maybe WSL2 is better at caching file access.

On Thu, 30 May 2024 at 20:51, mungre @.***> wrote:

Nice one. I guess that's quicker! Thanks for trying it out.

I did another simple change to read the whole source file in one go. It knocked another third off the run-time.

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load whole source file into a string

It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2140763543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIQ5YHS4C63QSH2RJDTZE57NPAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQG43DGNJUGM . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 1 month ago

I tried bogomandel on WSL2. It had a much more modest overall improvement from 4.1s to 2.7s. Faster than Windows to start with but not as fast after the work.

I also tried MMFS (the lastest master branch from your repo) and the overall improvement was quite small, 52s down to 47s, but I suppose it is doing a lot of non-beebasm stuff. Your computer must be quick though!

I haven't directly addressed the original problem. If you notice that adding long symbols names is still making it significantly slower please post an update.

mungre commented 1 month ago

This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.).

Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now.

The lengths of tokens are now pre-calculated.

Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now.

More bogomandel times on Windows:

6260ms - Original 5650ms - With simplified name mangling 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache (new) 1370ms - Calculate token lengths at compile time (new)

dp111 commented 1 month ago

Sorry for sending you down this path :). I only get about 15%-20% improvement . but an improvement is an improvement !

Thanks

On Mon, 3 Jun 2024 at 17:05, mungre @.***> wrote:

This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.).

Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now.

The lengths of tokens are now pre-calculated.

Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now.

More bogomandel times on Windows:

6260ms - Original 5650ms - With simplified name mangling 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache (new) 1370ms - Calculate token lengths at compile time (new)

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2145596180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIQ2Z7BTOU6ZQLTUKELZFSH5ZAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU4TMMJYGA . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 4 weeks ago

(Note that changes are now in mungre/perf.)

I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it.

bogomandel on Windows:

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache 1370ms - Calculate token lengths at compile time 1350ms - Calculate all token lengths at compile time 1255ms - Using unordered_map and a better hash function 1225ms - Improved ReadFile 1210ms - Removed redundant whitespace from EatWhitespace 1290ms - Removed SymbolName and SymbolNameCache 1200ms - Replaced isdigit, isalpha, toupper 885ms - New GetLine

Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped.

bogomandel on Linux:

3.2s originally 1.5s with latest fixes 0.5s with -O3 !!!!!

MMFS on WSL2: 52s originally 40s original and -O3 36s latest and -O3

The latest commit in mungre/perf adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker.

dp111 commented 4 weeks ago

it turns out I have already been using -O3 with the make file in the src directory.

On Tue, 11 Jun 2024 at 18:24, mungre @.***> wrote:

(Note that changes are now in mungre/perf https://github.com/stardot/beebasm/tree/mungre/perf.)

I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it.

bogomandel on Windows:

6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache 1370ms - Calculate token lengths at compile time 1350ms - Calculate all token lengths at compile time 1255ms - Using unordered_map and a better hash function 1225ms - Improved ReadFile 1210ms - Removed redundant whitespace from EatWhitespace 1290ms - Removed SymbolName and SymbolNameCache 1200ms - Replaced isdigit, isalpha, toupper 885ms - New GetLine

Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped.

bogomandel on Linux:

3.2s originally 1.5s with latest fixes 0.5s with -O3 !!!!!

MMFS on WSL2: 52s originally 40s original and -O3 36s latest and -O3

The latest commit in mungre/perf https://github.com/stardot/beebasm/tree/mungre/perf adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2161267611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVLASQWXPCA6BJ4MBTZG4XDXAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI3DONRRGE . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 4 weeks ago

That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated.

I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how.

More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops.

dp111 commented 4 weeks ago

no measurable difference for me, but I'm using gcc 14.1 which may have better optimisation. I don't know if you are interested -fanalyzer gives some potential warnings which i haven't checked. "cppcheck --enable=all ." has some performance suggestions, mainly parameters being passed by const reference. Again I don't know if these are real issues.

On Tue, 11 Jun 2024 at 21:16, mungre @.***> wrote:

That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated.

I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how.

More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2161537205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVRBG4K4RJDOSZAKFDZG5LKDAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGUZTOMRQGU . You are receiving this because you authored the thread.Message ID: @.***>

mungre commented 3 weeks ago

Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled.

I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively).

These changes are in mungre/perf again.

I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker.

On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now.

Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s.

I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead.

You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name.

dp111 commented 3 weeks ago

Just had a chance to test the latest changes for MMFS this takes the total build time down from 8.5seconds to 7 seconds. ( NB not all of this time is beebasm ) so another useful improvement !

On Wed, 12 Jun 2024 at 23:05, mungre @.***> wrote:

Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled.

I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively).

These changes are in mungre/perf https://github.com/stardot/beebasm/tree/mungre/perf again.

I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker.

On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now.

Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s.

I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead.

You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name.

— Reply to this email directly, view it on GitHub https://github.com/stardot/beebasm/issues/100#issuecomment-2163978319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIUNVJHPFDZUHGBWZE3ZHDAZNAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTHE3TQMZRHE . You are receiving this because you authored the thread.Message ID: @.***>