differentiate MAX/MIN[U] by register arguments.

David-Horner commented 3 years ago

Forgive me if this was addressed previously [in which case please give me location of discussion].

In a similar manner to SGE[U], defined in terms of SLT[U] with the arguments exchanged, so too MAX[U] can be defined as MIN[U] with the arguments exchanged, where MIN[U] requires rs1 to be a higher register number than rs2.

Similarly, MIN[U] with rs1 and rs2 the same should be reserved.

I suspect there are other BIT operators that have similar unity operations and should be reserved, and similar reflexive operations that could be implemented by numerically ordering the operands.

We rightly simplified RVI to allow some redundancies, this was and is very prudent for the base operations. However, the designers did not make glaringly wasteful op map decisions such as this one for MAX[U]. This was partly because of a judicious avoidance of these more esoteric instructions.

However, as we move to "optimizations" we should also ensure optimizations of the opcode space and if that necessitates increased implementation cost, so be it. All optimizations come with a cost. We are at a place now where we can control and limit the impacts. [better than waiting for the unintended consequences because we failed to discriminate and limit those effects].

I will link here the issue in code-size-reduction issue : Towards quantifying Optimization: guidelines and principles. - None without disruption https://github.com/riscv/riscv-code-size-reduction/issues/24

kdockser commented 3 years ago

Hi David,

I have interspersed my questions and comments below.

Thanks, Ken

From: David-Horner notifications@github.com Sent: Tuesday, March 2, 2021 1:30 PM To: riscv/riscv-bitmanip riscv-bitmanip@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [riscv/riscv-bitmanip] differentiate MAX/MIN[U] by register arguments. (#121)

CAUTION: This email originated from outside of the organization.

Forgive me if this was addressed previously [in which case please give me location of discussion].

In a similar manner to SGE[U], defined in terms of SLT[U] with the arguments exchanged, so too MAX[U] can be defined as MIN[U] with the arguments exchanged, where MIN[U] requires rs1 to be a higher register number than rs2.

KAD: Are you suggesting that the semantics of MAX[U] and MIN[U] be changed based on the architected register’s ordinal numbers? For example, MIN X5, X6, X7 would return the minimum of X6 and X7 whereas MIX X5, X7, X6 would return the maximum? While that would reduce the overall opcode space, it would also remove the orthogonality of the registers and cause no end to confusion.

Similarly, MIN[U] with rs1 and rs2 the same should be reserved.

KAD: We don’t do this with and or or, why should we do this with MIN[U]?

I suspect there are other BIT operators that have similar unity operations and should be reserved, and similar reflexive operations that could be implemented by numerically ordering the operands.

We rightly simplified RVI to allow some redundancies, this was and is very prudent for the base operations. However, the designers did not make glaringly wasteful op map decisions such as this one for MAX[U]. This was partly because of a judicious avoidance of these more esoteric instructions.

However, as we move to "optimizations" we should also ensure optimizations of the opcode space and if that necessitates increased implementation cost, so be it. All optimizations come with a cost. We are at a place now where we can control and limit the impacts. [better than waiting for the unintended consequences because we failed to discriminate and limit those effects].

I will link here the issue in code-size-reduction issue : Towards quantifying Optimization: guidelines and principles. - None without disruption riscv/riscv-code-size-reduction#24https://github.com/riscv/riscv-code-size-reduction/issues/24

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/riscv/riscv-bitmanip/issues/121, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI6QBRSR7XGBNO7WP3UR5Z3TBU4DZANCNFSM4YPTHVOQ.

David-Horner commented 3 years ago

On 2021-03-02 3:08 p.m., kdockser wrote:

Hi David,

I have interspersed my questions and comments below.

Thanks, Ken

From: David-Horner notifications@github.com Sent: Tuesday, March 2, 2021 1:30 PM To: riscv/riscv-bitmanip riscv-bitmanip@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [riscv/riscv-bitmanip] differentiate MAX/MIN[U] by register arguments. (#121)

CAUTION: This email originated from outside of the organization.

Forgive me if this was addressed previously [in which case please give me location of discussion].

In a similar manner to SGE[U], defined in terms of SLT[U] with the arguments exchanged, so too MAX[U] can be defined as MIN[U] with the arguments exchanged, where MIN[U] requires rs1 to be a higher register number than rs2.

KAD: Are you suggesting that the semantics of MAX[U] and MIN[U] be changed based on the architected register’s ordinal numbers? yes. For example, MIN X5, X6, X7 would return the minimum of X6 and X7 whereas MIX X5, X7, X6 would return the maximum? In effect yes, but MIN X5, X7, X6 would not actually be valid. While that would reduce the overall opcode space, it would also remove the orthogonality of the registers true. and cause no end to confusion. I disagree with this prognostication. I beilve the confusion [and debate] although initially prevalent will be constrained, and we can actively constrain the confusion. [ perhaps never the debate].

Similarly, MIN[U] with rs1 and rs2 the same should be reserved. ditto.

KAD: We don’t do this with and or or, why should we do this with MIN[U]?

Shortsightedness, expedience and practicality.

We could at least have reserved the OR where rs1 >= rs2. [reserved for AND and XOR as well, for use as NOR and NAND and EQU for example.]

And certainly, OR [and AND] with rs1 = rs2 is even now reasonably at least partially recoverable. They become moves,

And even more so when rd = rs1 = rs2 and they are noop, and can at least be used as hints.

But for RVI, the base definition, simplicity is a substantial asset.

The redundant formulations could reasonably have been reserved [shortsightedness].

Circuity is simpler with the redundancies - Expedience.

Limiting debate on the architecture by being uncontroversial is very practical.

I suspect there are other BIT operators that have similar unity operations and should be reserved, and similar reflexive operations that could be implemented by numerically ordering the operands.

We rightly simplified RVI to allow some redundancies, this was and is very prudent for the base operations. However, the designers did not make glaringly wasteful op map decisions such as this one for MAX[U]. This was partly because of a judicious avoidance of these more esoteric instructions.

However, as we move to "optimizations" we should also ensure optimizations of the opcode space and if that necessitates increased implementation cost, so be it. All optimizations come with a cost. We are at a place now where we can control and limit the impacts. [better than waiting for the unintended consequences because we failed to discriminate and limit those effects].

I will link here the issue in code-size-reduction issue : Towards quantifying Optimization: guidelines and principles. - None without disruption riscv/riscv-code-size-reduction#24https://github.com/riscv/riscv-code-size-reduction/issues/24

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/riscv/riscv-bitmanip/issues/121, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI6QBRSR7XGBNO7WP3UR5Z3TBU4DZANCNFSM4YPTHVOQ.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-bitmanip/issues/121#issuecomment-789180045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKPDAC6CBZ5IX3AUURDTBVAUHANCNFSM4YPTHVOQ.

rdolbeau commented 3 years ago

@David-Horner @kdockser

Currently with B to compute the absolute of difference [1]:

MIN x3,x1,x2
MAX x4,x1,x2
SUB x4,x4,x3

What happens if you need to order the registers ? Might need to move all the data around:

MIN x3,x1,x2
MV x6,x1
MV x5,x2
MAX x4,x6,x5
SUB x4,x4,x3

of course you can hope the compiler will be smart enough to not use x1 so it can use only one move:

MIN x3,x10,x11
MV x5,x11
MAX x4,x10,x5
SUB x4,x4,x3

In all case, it makes for worse code if you need both MIN and MAX, and it puts a lot of extra burden on the register allocator. Which is not a good idea, from a compiler's write point of view. Or from an implementer's point of view, as it adds another weird case to the decoding.

Please don't over-optimize the encoding to the point the ISA is no longer regular/orthogonal enough for compilers and simple code patterns.

[1] Someone cares about the 32-bits version: https://github.com/riscv/riscv-p-spec/issues/38

Edit: got the order wrong in MAX (I think - hopefully I got it right now...)

David-Horner commented 3 years ago

I would appreciate having these discussions in the context of the larger issue:

I started an "issue" related to optimizations and would appreciate any help fleshing out or refining that issue:

Towards quantifying Optimization: guidelines and principles. - None without disruption https://github.com/riscv/riscv-code-size-reduction/issues/24

https://github.com/riscv/riscv-code-size-reduction/issues/24

On 2021-03-03 4:20 a.m., Romain Dolbeau wrote:

@David-Horner https://github.com/David-Horner @kdockser https://github.com/kdockser

Currently with |B| to compute the absolute of difference [1]:

|MIN x3,x1,x2 MAX x4,x1,x2 SUB x4,x4,x3 |

What happens if you need to order the registers ?

It is a given that the compiler will definitely need to work harder.

Might need to move all the data around:

Emphasis on might. What of the typical use case?

There we can leverage the redundancy already built into the RVI base.

|MIN x3,x1,x2 MV x6,x1 MV x5,x2 MAX x4,x5,x6 SUB x4,x4,x3 |

of course you can hope the compiler will be smart enough to not use |x1| so it can use only one move:

The compiler should be smart enough not to use x1. Partially because of its fortuitous use as ra [return address]

[good point that range of potential candidates is restricted, but with a 31 X register set, there is still considerable freedom.]

|MIN x3,x10,x11 MV x5,x11 MAX x4,x5,x10 SUB x4,x4,x3 |

In all case, it makes for worse code if you need both |MIN| and |MAX|,

I appreciate the reference to the worst case analysis. It is important to examine all cases including edge/fringe.

But for this situation we do not need both |MIN| and |MAX|:

|SUB x3,x1,x2 SUB x4,x2,x1 MAX x4,x3,x4 |

will do the required operation.

Superoptimization would find this and many other alternate ways of rearranging registers to locally optimize the code.

Note: Unlike ADD which is commutative, SUB is not , and so both versions of SUB would be reasonable for RVI, even if we were to have applied these "rules" back in the primordial days.

and it puts a lot of extra burden on the register allocator.

Yes. No optimization is free.

Which is not a good idea, from a compiler's write point of view.

True. Always tradeoffs.

However, tool-chain can continue to evolve even when hardware is years old and invariant.

I don't want to call it a good idea, but at least its time has come.

Or from an implementer's point of view, as it adds another weird case to the decoding.

Not necessarily. Reserved only means your simplified implementation is non-conformant for potential future extensions.

Please don't over-optimize the encoding to the point the ISA is no longer regular/orthogonal enough for compilers

That isn't likely to happen. A brief look at x86 demonstrates that the "regular/orthogonal enough" bar is extremely low.

Compilers can tolerate, as you pointed out, by register shuffle.

Superoptimization will help eliminate the register shuffle.

I have no special influence, and there will be those, such as yourself, that will push back when the tipping point is approached.

and simple code patterns.

I don't see the importance of this objective.

Simple code patterns are being abandoned without fanfare nor alarm by many optimizations compilers are implementing.

In addition, RVC is the obvious optimization example. Yes, there have been nay-sayers, but the benefits counter the lack of regular/orthogonal definition.

[1] Someone cares about the 32-bits version: riscv/riscv-p-spec#38 https://github.com/riscv/riscv-p-spec/issues/38

OK. I will concede that was your motivation for you comments there.

However, when I reviewed it before your disclosure, and even now in rereading, I could not ascertain your mind-view.

There were at least two distinct concerns/situations in that single issue. Which one characterizes your concern about wasting 32bit encoding space?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-bitmanip/issues/121#issuecomment-789566607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKLZABON5U2OXAWW3P3TBX5MVANCNFSM4YPTHVOQ.

rdolbeau commented 3 years ago

I appreciate the reference to the worst case analysis. It is important to examine all cases including edge/fringe. But for this situation we do not need both |MIN| and |MAX|: |SUB x3,x1,x2 SUB x4,x2,x1 MAX x4,x3,x4 | will do the required operation.

Sorry, should have use MINU/MAXU in there to make it more obvious. Using sub/sub/maxu is not a solution for unsigned (0x1-0x0->0x1, 0x0-0x1->0xFFFFFFFF, maxu(0x1,0xFFFFFFFF)->0xFFFFFFFF which is not the answer, 0x1 is).

There is also some interesting corner cases for signed (i.e. (0x80000000-0x0->0x80000000, 0x0-0x80000000 -> 0x80000000, max will be 0x80000000 which is negative so not an absolute value).

David-Horner commented 3 years ago

On 2021-03-03 9:03 a.m., Romain Dolbeau wrote:

I appreciate the reference to the worst case analysis. It is
important to examine all cases including edge/fringe. But for this
situation we do not need both |MIN| and |MAX|: |SUB x3,x1,x2 SUB
x4,x2,x1 MAX x4,x3,x4 | will do the required operation.
and be shorter in RVC code size : 64bits ( 16, 16 and 32) vs 80 (32, 32 and 16).

Sorry, should have use MINU/MAXU in there to make it more obvious. Using sub/sub/maxu is not a solution for unsigned (0x1-0x0->0x1, 0x0-0x1->0xFFFFFFFF, maxu(0x1,0xFFFFFFFF)->0xFFFFFFFF which is not the answer, 0x1 is).

True. So adding the single move for this use case it will work. Still a low overhead, as C.MV is 16bit instruction in addition to an already 80bit sequence .

Implementations can and will fuse such C.MV MAXU [C.MV MINU] sequences if they are frequent, so we can get performance back as well.

The arguments you are making are worth raising, but my responses are exactly the same as those used by Krste and Andrew to justify the RVI when similarly challenged for inefficiency.

But don't forget that this "local MINU/MAXU optimization" use is in the context of a larger program flow that the compilers will advance to improve, undoubtedly optimizing out many of those moves.

RVI already empowered the tool-chain by providing the flexibility/redundancy there. It is present [for better or worse] so we should leverage it for the greater good in all new extensions.

There is also some interesting corner cases for signed (i.e. (0x80000000-0x0->0x80000000, 0x0-0x80000000 -> 0x80000000, max will be 0x80000000 which is negative so not an absolute value).

agreed.

But is this relevant to the discussion at hand?

The simplest approach is to provide a branch detecting the exception case, as explained in Vol I to detect overflow. This works for both formulations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-bitmanip/issues/121#issuecomment-789736078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKIDBWD2ZDNPWWDCV2DTBY6S7ANCNFSM4YPTHVOQ.

topperc commented 3 years ago

Maybe I'm missing something, but I understood the proposal was to merge the MIN[U] and MAX[U] opcodes and use the register ordinals to distinquish the operation

So wouldn't

MINU x3,x1,x2
MAXU x4,x1,x2
SUB x4,x4,x3

Just need to be written as

CombinedMINUMAXUOpcode x3,x2,x1 // rs1 > rs2 -> min operation
CombinedMINUMAXUOpcode x4,x1,x2 // rs1 < rs2 -> max operation
SUB x4,x4,x3

Obviously we'd keep the MINU/MAXU mnemonics for readablity. I only merged the opcode name for illustration.

kdockser commented 3 years ago

Hi Craig,

There was a proposal to change the semantics of this and other bitmanip instructions to vary based on the ordinal relationship of the registers. While that would reduce the already small opcode footprint into something smaller, it would also make the code much harder to read, write, and debug. Many have gone down this path before, and they have learned from it.

It is better for us to work on providing new standard instructions that reduce the code footprint while increasing performance rather than to try to reduce these small opcode footprints.

The Zb[abcs] instructions, encodings, and mnemonics have already been approved and will be heading for public review as soon as we iron out some issues with the auto-generation of the AsciiDoc spec for public review.

Thanks, Ken

From: Craig Topper @.> Sent: Friday, April 2, 2021 3:02 PM To: riscv/riscv-bitmanip @.> Cc: Ken Dockser @.>; Mention @.> Subject: Re: [riscv/riscv-bitmanip] differentiate MAX/MIN[U] by register arguments. (#121)

CAUTION: This email originated from outside of the organization.

Maybe I'm missing something, but I understood the proposal was to merge the MIN[U] and MAX[U] opcodes and use the register ordinals to distinquish the operation

So wouldn't

MINU x3,x1,x2

MAXU x4,x1,x2

SUB x4,x4,x3

Just need to be written as

CombinedMINUMAXUOpcode x3,x2,x1 // rs1 > rs2 -> min operation

CombinedMINUMAXUOpcode x4,x1,x2 // rs1 < rs2 -> max operation

SUB x4,x4,x3

Obviously we'd keep the MINU/MAXU mnemonics for readablity. I only merged the opcode name for illustration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/riscv/riscv-bitmanip/issues/121#issuecomment-812688716, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI6QBRUQKIDZHYTGRBG3O3DTGYPDZANCNFSM4YPTHVOQ.

topperc commented 3 years ago

Hi Ken,

I was just trying to understand the proposal and how a compiler would implement it if the extra moves in the discussion where indeed required. I did not mean to imply that I thought the proposal should be implemented.

David-Horner commented 3 years ago

On 2021-04-02 4:34 p.m., Craig Topper wrote:

Hi Ken,

I was just legitimately trying to understand the proposal and how a compiler would implement it if the extra moves in the discussion where indeed required.

I certainly appreciate your concern to understand the issue before deciding yeah or nay.

On 2021-04-02 4:20 p.m., kdockser wrote:

There was a proposal to change the semantics of this and other bitmanip instructions to vary based on the ordinal relationship of the registers.

There is still this proposal open.

Please direct me to the previous discussions on this issue. That was my request in the opening line of this issue:

Forgive me if this was addressed previously [in which case please give me location of discussion].

While that would reduce the already small opcode footprint into something smaller, it would also make the code much harder to read, write, and debug.

Can you quantify the degree of "much harder"?

x86 , x186, x286, 386, 486 and 586 all survived well with quirky "special" format instructions.

Granted there was an evolution to increasingly orthogonal instructions, but this "waste of opcode space" was in a context of an already complex multibyte variable length instruction decoding. RISCV can substantially benefit from miserly allocation by RISCV International reserved opcode space. As custodians the organization must consider not only current pressures for instruction expansion, but also future, even as yet unseen, extensions which should not be penalized because others crossed the line first.

Many have gone down this path before, and they have learned from it.

Can you give examples? Any papers of lessons learned?

Are they immediately applicable to RISCV in our current "enlightened" environment.

It is better for us to work on providing new standard instructions that reduce the code footprint while increasing performance rather than to try to reduce these small opcode footprints. These goals are not mutually exclusive, nor are the related activities.

The Zb[abcs] instructions, encodings, and mnemonics have already been approved

I was unaware that formal approval was issued.

When? And where is the announcement?

Was it prior to me opening this issue?

and will be heading for public review as soon as we iron out some issues with the auto-generation of the AsciiDoc spec for public review.

Public Review allows and anticipates questions and concerns will be raised, and there is a mandate to address those questions, concerns and issues before ratification can proceed.

It is better that the issues be addressed here, on the merits with substantiated evidence and reason, rather than push to review with open issues.

Thanks, Ken

kdockser commented 3 years ago

Hi David,

While it is true that X86 has survived with quirky encodings – it was the IBM PC and the need for backward compatibility that allowed it to survive despite its warts. However, you don’t see X86 in cell phones – RISC (and more specifically ARM) was the winner in that domain.

There are no new viable X86-like ISAs. The decoding is too complex which unnecessarily burns power and adds delay.

We were working towards the freezing of Zb[abcs] late last year. We froze them in January of this year. This was discussed on this mail list. We are in the process of creating a document of these instructions for public review – this is taking longer than expected due to the conversion to AsciiDoc. We hope to have these SNAFUs resolved soon and then we will release the document and announce the beginning of public review.

Cheers, Ken

From: David-Horner @.> Sent: Saturday, April 3, 2021 12:12 AM To: riscv/riscv-bitmanip @.> Cc: Ken Dockser @.>; Mention @.> Subject: Re: [riscv/riscv-bitmanip] differentiate MAX/MIN[U] by register arguments. (#121)

CAUTION: This email originated from outside of the organization.

On 2021-04-02 4:34 p.m., Craig Topper wrote:

Hi Ken,

I was just legitimately trying to understand the proposal and how a compiler would implement it if the extra moves in the discussion where indeed required.

I certainly appreciate your concern to understand the issue before deciding yeah or nay.

On 2021-04-02 4:20 p.m., kdockser wrote:

There was a proposal to change the semantics of this and other bitmanip instructions to vary based on the ordinal relationship of the registers.

There is still this proposal open.

Please direct me to the previous discussions on this issue. That was my request in the opening line of this issue:

Forgive me if this was addressed previously [in which case please give me location of discussion].

While that would reduce the already small opcode footprint into something smaller, it would also make the code much harder to read, write, and debug.

Can you quantify the degree of "much harder"?

x86 , x186, x286, 386, 486 and 586 all survived well with quirky "special" format instructions.

Granted there was an evolution to increasingly orthogonal instructions, but this "waste of opcode space" was in a context of an already complex multibyte variable length instruction decoding. RISCV can substantially benefit from miserly allocation by RISCV International reserved opcode space. As custodians the organization must consider not only current pressures for instruction expansion, but also future, even as yet unseen, extensions which should not be penalized because others crossed the line first.

Many have gone down this path before, and they have learned from it.

Can you give examples? Any papers of lessons learned?

Are they immediately applicable to RISCV in our current "enlightened" environment.

It is better for us to work on providing new standard instructions that reduce the code footprint while increasing performance rather than to try to reduce these small opcode footprints. These goals are not mutually exclusive, nor are the related activities.

The Zb[abcs] instructions, encodings, and mnemonics have already been approved

I was unaware that formal approval was issued.

When? And where is the announcement?

Was it prior to me opening this issue?

and will be heading for public review as soon as we iron out some issues with the auto-generation of the AsciiDoc spec for public review.

Public Review allows and anticipates questions and concerns will be raised, and there is a mandate to address those questions, concerns and issues before ratification can proceed.

It is better that the issues be addressed here, on the merits with substantiated evidence and reason, rather than push to review with open issues.

Thanks, Ken

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/riscv/riscv-bitmanip/issues/121#issuecomment-812813175, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AI6QBRSYKTF7EBC5VLTJKBDTG2PTHANCNFSM4YPTHVOQ.

kdockser commented 3 years ago

While all optimizations come at some cost, we also need to look at the benefit and especially the benefit/cost efficiency. 16-bit opcode space is at a premium and needs to be allocated very very carefully. 32-bit opcode space is relatively plentiful, but still needs to be allocated carefully. The space savings that can be had from changing instruction semantics based on the ordinal values of the registers used doesn't appear to outweigh the cost of decode complexity. Being more selective about which instructions get added to the architecture will yield better dividends. I am closing this issue, but we can continue to discuss it elsewhere.

riscv / riscv-bitmanip

differentiate MAX/MIN[U] by register arguments. #121