Feedback for v0.1.3 - Githubissues

snehasish commented 1 year ago

Chapter 1. Introduction

"in-target use": Can we clarify what this means? My understanding is that it tries to distinguish from trace where the consumers are "external"?
"internal chip storage": Since we debated the options and decided on CSRs should we update this text too? We do mention CSRs when describing the CTR array.

Chapter 2. CSRs

"when V=0, ..": I believe this refers to virtualization? Can we clarify this since later on in 3.1 we also introduce a valid bit V with the same notation. Though I do see most of the uses of the latter also include the source register , eg. ctrsource.V
"the CSR number is 0x381": What is the implication of this? I think this means that software such as perf will expect all implementations to use CSR 0x381 to refer to e.g. mctrcontrol. Can we call this out explicitly? Also a table which holds all the relevant CSRs in one place might be a good reference.
Can we add a footnote which links to the definitions of WARL (Write Any, Read Legal), WLRL (Write Legal, Read Legal). Searching online brings up discussions and requests for clarification.
[nitpick] Can we keep the EN bits together and the INH bits grouped together? Is there a rationale for the ordering of bits for transfer type filtering today? I wonder if it makes software easier to interpret.
typo missing word in 2.4 "but must sufficient to ..."

Chapter 3. Entry Registers

Transfer type field - What is the intention for reserving 0110 and 0111?
Transfer type field - Can we clarify "other indirect jump" and "other direct jump". How do these differ from indirect jump and direct jump? (I see that these line up with the E-Trace itype encodings, it would be good to add more details here so this is somewhat self-contained).
Should we add separate filtering for jumps that modify the RAS? Chapter 2.5 in the unprivileged RISC-V spec states "A JAL instruction should push the return address onto a return-address stack (RAS) only when rd=x1/x5. JALR instructions should push/pop a RAS as shown in the Table 2.1". In particular I'm interested in understanding whether we can filter tail calls (tail offset pseudo instruction Table 25.3 in the unprivileged RISC-V spec).

Chapter 4. State enable access control Can we merge this with Chapter 6 Discovery? They seem related to me.

Chapter 5. Behaviour

Cycle Counting: Instead of highlighting just (2^12 + CCM) << CCE-1 can we put the entire heuristic including the direct interpretation when CC count < 64K. I like how ARM specifies it in their documentation here. It's easy to miss the latter case spelled out in the text.
RAS Emulation Mode: As mentioned previously, can we support pseudo-instructions such as tail calls to subroutines? From a profiling tools perspective this would be a major improvement compared to existing transfer recording mechanisms.

bcstrongx commented 1 year ago

Chapter 1. Introduction

"in-target use": Can we clarify what this means? My understanding is that it tries to distinguish from trace where the consumers are "external"?

I just deleted the reference to "in-target use". It was meant to distinguish between self-hosted use vs probe debugger use, but I think it's unnecessary.

"internal chip storage": Since we debated the options and decided on CSRs should we update this text too? We do mention CSRs when describing the CTR array.

I changed it to say "register-accessible internal chip storage". We use S*csrind to access the state, but the storage could be an array, SRAM, whatever.

Chapter 2. CSRs

"when V=0, ..": I believe this refers to virtualization? Can we clarify this since later on in 3.1 we also introduce a valid bit V with the same notation. Though I do see most of the uses of the latter also include the source register , eg. ctrsource.V

I added clarification with the first reference to V=0 and V=1. For background, this terminology comes from the priv spec, in the definition of the Hypervisor (H) extension:

When V=1, the hart is either in virtual S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest OS running in VS-mode. When V=0, the hart is either in M-mode, in HS-mode, or in U-mode atop an OS running in HS-mode.

"the CSR number is 0x381": What is the implication of this? I think this means that software such as perf will expect all implementations to use CSR 0x381 to refer to e.g. mctrcontrol. Can we call this out explicitly? Also a table which holds all the relevant CSRs in one place might be a good reference.

The table I can add, but I'm not sure what could be more explicit than what is stated there.

Can we add a footnote which links to the definitions of WARL (Write Any, Read Legal), WLRL (Write Legal, Read Legal). Searching online brings up discussions and requests for clarification.

Added a non-normative statement atop the CSR chapter.

[nitpick] Can we keep the EN bits together and the INH bits grouped together? Is there a rationale for the ordering of bits for transfer type filtering today? I wonder if it makes software easier to interpret.

All the *INH bit offsets (from bit 32) are based on the E-trace itype encoding. Not-taken branches are encoding 4, hence NTBREN is bit 36. The adjacent encodings are defined as other things. For ETEN, which has no equivalent in the E-trace encodings, I picked an offset that is reserved. So I don't see a good way to put them together.

typo missing word in 2.4 "but must sufficient to ..."

Fixed.

Chapter 3. Entry Registers

Transfer type field - What is the intention for reserving 0110 and 0111?

Those are unused by E-trace (at least, when using a 4-bit itype encoding)

Transfer type field - Can we clarify "other indirect jump" and "other direct jump". How do these differ from indirect jump and direct jump? (I see that these line up with the E-Trace itype encodings, it would be good to add more details here so this is somewhat self-contained).

I toyed with this, but can't come up with a way to summarize without essentially duplicating what's in the E-trace spec. I would be nice if we had html specs, and I could more easily reference the same source info. It is weird that, once this is integrated into the priv spec, it will have a dependency on a non-ISA spec. I'll ask ARC how to deal with this.

Should we add separate filtering for jumps that modify the RAS? Chapter 2.5 in the unprivileged RISC-V spec states "A JAL instruction should push the return address onto a return-address stack (RAS) only when rd=x1/x5. JALR instructions should push/pop a RAS as shown in the Table 2.1". In particular I'm interested in understanding whether we can filter tail calls (tail offset pseudo instruction Table 25.3 in the unprivileged RISC-V spec).

We already can filter on instructions that modify the RAS. A user can inhibit all types other than calls, returns, and co-routine swaps to get that behavior.

Unfortunately, I don't know of any calling convention for tail-calls. That table has been removed from more recent revisions of the unpriv spec.

Chapter 4. State enable access control Can we merge this with Chapter 6 Discovery? They seem related to me.

State enable is certainly relevant to discovery, but the Smstateen extension was actually intended more as a way for privileged SW to hide capabilities of which it is unaware, and thus won't know to context switch. Imagine a CTR-unaware OS running on newer HW that supports CTR.

Chapter 5. Behaviour

Cycle Counting: Instead of highlighting just (2^12 + CCM) << CCE-1 can we put the entire heuristic including the direct interpretation when CC count < 64K. I like how ARM specifies it in their documentation here. It's easy to miss the latter case spelled out in the text.

Switched to: CCE==0? CCM : ((2^12^ + CCM) << CCE-1)

RAS Emulation Mode: As mentioned previously, can we support pseudo-instructions such as tail calls to subroutines? From a profiling tools perspective this would be a major improvement compared to existing transfer recording mechanisms.

As mentioned above, there doesn't seem to be a convention that would allow us to uniquely identify tail-calls. But how are you thinking we would use them? Treat them like a call and record them, but track the WRPTR at the time of the last true call so that the subsequent RET pops the original callee plus all the tail-call entries?

bcstrongx commented 1 year ago

I don't seem to be able to add you as a reviewer for the PR, but please have a look

bcstrongx commented 1 year ago

Added another commit to the PR

snehasish commented 1 year ago

there doesn't seem to be a convention that would allow us to uniquely identify tail-calls

I see the text and the table on page 27 of the latest draft. Can we leverage the implicit hinting to update the CTR just like a RAS implementation?

But how are you thinking we would use them? Treat them like a call and record them, but track the WRPTR at the time of the last true call so that the subsequent RET pops the original callee plus all the tail-call entries?

Yes and handle overflow where we need to invalidate all CTR records.

bcstrongx commented 1 year ago

I see the text and the table on page 27 of the latest draft. Can we leverage the implicit hinting to update the CTR just like a RAS implementation?

Yes, that calling convention is how we bucket JAL/JALR instructions into the transfer types. Pop=Return, Push=Call, and "Pop, then Push"=Co-routine Swap. And the actions taken on the CTR stack when RASEMU=1 match the RAS actions listed in that table.

Yes and handle overflow where we need to invalidate all CTR records.

Unfortunately it doesn't seem that RISC-V adopted a convention to identify tail-calls the way they did for calls and returns. But even if they did, there would be implementation complexity in making the return pop all the tail-call entries. Today each update touches only one entry.

I think tail calls are usually direct and unconditional, so could the tools not follow them even without help from CTR? For instance, if the last call in CTR is to function foo(), but we get a sample in function bar(), could the tools not follow foo's tail-call (and any nested tail calls) until it reaches bar()?

snehasish commented 1 year ago

Ah, I get it now. I missed that you were referring to the table which contained the tail offset pseudo-instruction.

Unfortunately it doesn't seem that RISC-V adopted a convention to identify tail-calls the way they did

Do you think this is feasible in the future? Any idea why the recommended lower for the pseudo-instruction was omitted in the recent update to the unprivileged spec?

I think tail calls are usually direct and unconditional,

Yes

so could the tools not follow them even without help from CTR? For instance, if the last call in CTR is to function foo(), but we get a sample in function bar(), could the tools not follow foo's tail-call (and any nested tail calls) until it reaches bar()?

There may be intermediate tail jumps from the parent function foo() to other functions before it reaches bar(). Exploring the nested tail calls may lead to multiple potential paths to bar(). Consider this example from a protobuf parser. A sample in _upb_FastDecoder_TagDispatch could have come directly via the tail call (enforced by the UPB_MUSTTAIL attribute) at L100 or indirectly via another tail call at L95.

On a separate note: In the current version of the spec for cycle counting we note that the CCV may be invalid on power state transitions. I'm wondering if we want to make a broader statement about the CTR itself. Clarifying that we leave it to implementations could be an option too. As far as I know Intel does not preserve LBRs C-state transitions.

bcstrongx commented 1 year ago

Do you think this is feasible in the future? Any idea why the recommended lower for the pseudo-instruction was omitted in the recent update to the unprivileged spec?

This is probably a better question for a software person, but we'd have to get a new convention defined, get compilers to use it, then recompile all the existing code. I don't know if it's "too late", but sure seems like it would have been easier to do earlier. Andrew Waterman would probably be a good person to ask about why it was removed.

Thanks for the example, with 2 separate tail-calls in a function then I agree, we'd need CTR (or something) to resolve it.

On a separate note: In the current version of the spec for cycle counting we note that the CCV may be invalid on power state transitions. I'm wondering if we want to make a broader statement about the CTR itself. Clarifying that we leave it to implementations could be an option too. As far as I know Intel does not preserve LBRs C-state transitions.

I agree. LBRs are preserved in C1, but not the deeper C6. I'll add something about this.

riscv / riscv-control-transfer-records

Feedback for v0.1.3 #3