Several questions about TLO

Hi!

Thank you a lot for open-sourcing the project! I am a fan of different ways to optimize applications (LTO, PGO, PLO, ML-based stuff, etc.) and do corresponding benchmarks for these technologies.

After reading the Wiki page (btw, nice explanation) I have several questions/discussion points that I want to discuss about the tool:

Which hardware architecture (like x86_64 or AArch64) are supported for running thin-layout-optimizer?
Which operating systems are supported for running thin-layout-optimizer? Linux, Windows, macOS, *BDS, anything else?
Are there any limitations for a target hardware architectures/OS of a binary that is optimized? As far as I understand - no, since it's just a linker script but want to reaffirm my claim.
What are the Intel plans for TLO? Do you plan to invest resources seriously into the project or is it just a semi-research tool? Is it a tool that designed specifically for Clear Linux use-cases? Discussing this topic is very important since it can influence decisions about TLO integration in different companies.
Is there any kind of roadmap for TLO? If yes, is it possible to make it public? Like when CDSort can be expected to be enabled in TLO.
According to the README, only ld and gold are supported for now. Do you have plans to support other linkers like ldd and mold?
Do you plan to implement an instrumentation mode for the tool like BOLT did? There are many cases when LBR is not available (like AMD added support only 3-4 years ago, ARM only in the last year, RISC-V doesn't support it all yet, virtualization limitations for using LBR) - in these cases, TLO usage is impossible right now.
A question about the save states. Since it's a file format, what about forward/backward compatibility guarantees for them? Is it guaranteed that I can use save states from an older TLO version with a newer one? What about using save states from a newer TLO with an older TLO (a case when we downgraded TLO for some reason)?
Do you plan to support other profilers like Intel VTune?
I can suggest to enable LTO + PGO + TLO optimization for TLO itself. It can help with improving processing large Linux perf profiles.
Do you have more TLO benchmarks about its efficiency in practice for different applications in different domains? Having only one example with Clang is not enough (IMHO). E.g. for BOLT I have this list. Not so huge too by anyway - hopefully it can help with choosing other projects to test. Do you have more benchmark results from Clear Linux?
For the -h command returns 1 exit code instead of 0 - https://github.com/intel/thin-layout-optimizer/blob/main/src/main/main.cc#L337 . Is it intended behavior? According to this link, 1 should be used in error cases.
Having prebuilt binaries can help users just download and run the tool instead of compiling it locally. I think it can be done via GitHub Actions or any similar service. Nice side outcome: if you optimize TLO itself with LTO + PGO + TLO - it will be a great example of how TLO can integrated into reality ;)

I created the issue since Discussions are disabled for the repo for now.

Thank you in advance for answers.

Hi!

Thank you a lot for open-sourcing the project! I am a fan of different ways to optimize applications (LTO, PGO, PLO, ML-based stuff, etc.) and do corresponding benchmarks for these technologies.

After reading the Wiki page (btw, nice explanation) I have several questions/discussion points that I want to discuss about the tool:

Which hardware architecture (like x86_64 or AArch64) are supported for running thin-layout-optimizer?

At the moment only x86_64. This is mostly because our profiles rely on LBR. That being said, if any other arch implemented something similiar we could support that as well.

Which operating systems are supported for running thin-layout-optimizer? Linux, Windows, macOS, *BDS, anything else? Linux for running. Profiles can be from any OS that supports perf (likely only linux).

Are there any limitations for a target hardware architectures/OS of a binary that is optimized? As far as I understand - no, since it's just a linker script but want to reaffirm my claim.

You can run TLO on any arch. Its only been tested on x86_64/linux, however. The profiles require perf + LBR which at the moment is only linux + x86_64.

What are the Intel plans for TLO? Do you plan to invest resources seriously into the project or is it just a semi-research tool? Is it a tool that designed specifically for Clear Linux use-cases? Discussing this topic is very important since it can influence decisions about TLO integration in different companies.

We plan to maintain it (I know taking a week to get to this comment doesn't really help that case). I didn't have issue notications on! Will be more responsive in the future.

The goal is to make this tool usable for all linux distros. We currently use on clearlinux but are working to get it into rhel and hopefully more distrobutions. We put on heavy emphasis on making adoption easy for a reason... so it will be adopted.

Is there any kind of roadmap for TLO? If yes, is it possible to make it public? Like when CDSort can be expected to be enabled in TLO.

The unofficial roadmap is: 1) Update the binutils patches. After talking with RHEL they have some additional reqs. 2) CDSort.

According to the README, only ld and gold are supported for now. Do you have plans to support other linkers like ldd and mold?

Yes and no. Its not on the roadmap now, but if we have users that require ldd/mold, we absolutely can.

Do you plan to implement an instrumentation mode for the tool like BOLT did? There are many cases when LBR is not available (like AMD added support only 3-4 years ago, ARM only in the last year, RISC-V doesn't support it all yet, virtualization limitations for using LBR) - in these cases, TLO usage is impossible right now.

Uncertain. At the moment no plans, but that may change.

A question about the save states. Since it's a file format, what about forward/backward compatibility guarantees for them? Is it guaranteed that I can use save states from an older TLO version with a newer one? What about using save states from a newer TLO with an older TLO (a case when we downgraded TLO for some reason)?

We will do our best to be backwards compatible. If we ever break it will be a major version change to TLO. I would expect we will be able to be 100% backward compatible.

Do you plan to support other profilers like Intel VTune?

Uncertain. Not on our roadmap now. Really depends on user demand.

I can suggest to enable LTO + PGO + TLO optimization for TLO itself. It can help with improving processing large Linux perf profiles.

When handling large profiles, most of the time is spent doing IO in perf itself.

Do you have more TLO benchmarks about its efficiency in practice for different applications in different domains? Having only one example with Clang is not enough (IMHO). E.g. for BOLT I have this list. Not so huge too by anyway - hopefully it can help with choosing other projects to test. Do you have more benchmark results from Clear Linux?

Let me get back to you on that.

For the -h command returns 1 exit code instead of 0 - https://github.com/intel/thin-layout-optimizer/blob/main/src/main/main.cc#L337 . Is it intended behavior? According to this link, 1 should be used in error cases.

Will fix.

Having prebuilt binaries can help users just download and run the tool instead of compiling it locally. I think it can be done via GitHub Actions or any similar service. Nice side outcome: if you optimize TLO itself with LTO + PGO + TLO - it will be a great example of how TLO can integrated into reality ;)

Will do.

I created the issue since Discussions are disabled for the repo for now.

Thank you in advance for answers.

Also, thank you for taking the time to read up on the project and ask these questions.

I pinned this issue as it asks a lot of very good questions and update answers going forward.

For the -h command returns 1 exit code instead of 0 - https://github.com/intel/thin-layout-optimizer/blob/main/src/main/main.cc#L337 . Is it intended behavior? According to this link, 1 should be used in error cases.

Will fix.

Fixed in: https://github.com/intel/thin-layout-optimizer/commit/0b90310e10c2db3bc3ccd60d52db2ede67de9295

Thank you for your answers!

That being said, if any other arch implemented something similiar we could support that as well.

Other architectures have similar technologies in different implementation stages:

ARM64 - it's called BRBE (Branch Record Buffer Extension): since ARMv9.2-A (2023), Linux 6.7-rc1 (2024). Related LWN article is here.
PowerPC - it's called BHRB (Branch History Rolling Buffer): since Power8 (~2013), Linux perf support is also in place.
RISC-V - it's called Control Transfer Records (CTR). Current status - under development. More information can be found here: GitHub, JIRA. No estimations for when it will be implemented.
e2k (Elbrus) - an analog to LBR is supported. Unfortunately, there is no support for that in Linux perf but there are plans to implement this feature in the future. Anyway - I don't think that you are interested in e2k architecture :)

Didn't check before MIPS architecture - maybe something similar to LBR is available on this architecture too. However, not sure if is it enough to have such technologies for TLO or not.

Thank you for your answers!

That being said, if any other arch implemented something similiar we could support that as well.

Other architectures have similar technologies in different implementation stages:

ARM64 - it's called BRBE (Branch Record Buffer Extension): since ARMv9.2-A (2023), Linux 6.7-rc1 (2024). Related LWN article is here.

PowerPC - it's called BHRB (Branch History Rolling Buffer): since Power8 (~2013), Linux perf support is also in place.

RISC-V - it's called Control Transfer Records (CTR). Current status - under development. More information can be found here: GitHub, JIRA. No estimations for when it will be implemented.

e2k (Elbrus) - an analog to LBR is supported. Unfortunately, there is no support for that in Linux perf but there are plans to implement this feature in the future. Anyway - I don't think that you are interested in e2k architecture :)

Didn't check before MIPS architecture - maybe something similar to LBR is available on this architecture too. However, not sure if is it enough to have such technologies for TLO or not.

So I would say at the moment there is still a fairbit of work that takes priority > supporting other arch, but further support is def an eventual todo.

Thank you for these resources.

intel / thin-layout-optimizer

Several questions about TLO #2