polarfire-soc / polarfire-soc-documentation

PolarFire SoC Documentation
Other
37 stars 19 forks source link

Accessing CSRs / Hardware Performance Monitor Info #18

Closed dawsfox closed 2 years ago

dawsfox commented 2 years ago

Hi, I am not sure if this is the correct repo to leave this issue on, so let me know if I should put it elsewhere.

I am trying to test the bandwidth of the LSRAM module included in the release that can be accessed through the Linux OS in the /opt/microchip/fpga-fabric-interfaces directory (incidentally, I'm also having a hard time finding where this code is included in a repo...). The library has shown itself to be insufficient, given that example reports <100 clock ticks used to write 1024 elements to the LSRAM after adding some clock() calls. For more detailed benchmarking and to test the practical bandwidth between the MSS and the LSRAM module in the FPGA, I would like to access the CSRs that are mentioned in the MSS Technical Reference as part of the Hardware Performance Monitor. However, I can't find any information about how those CSRs or that information might be exposed from the perspective of user code.

hughbreslin commented 2 years ago

Hey @dawsfox 🙂

The example code is available in the Linux Examples repo here.

In terms of the CSRs and accessing them from user code this isn't something we support yet but we are having a look into into it. We'll hopefully be able to provide a driver to allow access and support the performance counters. Would you be able to use one of the other timers in the MSS or a counter in the fabric to achieve this goal in the meantime?

dawsfox commented 2 years ago

Thanks for your response @hughbreslin ! While looking about for the timers I ended up realizing the RISC V ISA includes pseudo-instructions to read the CSRs and have exposed their values through inline assembly like so (this one is for the number of instructions retired): static inline uint32_t rdret() { uint32_t val; asm volatile ("rdinstret %0 ;\n":"=r" (val) ::); val; )

Then using them like: uint32_t ret_beg = rdret(); / Whatever code you're testing / uint32_t ret_end = rdret(); uint32_t inst_retired = ret_end - ret_beg;

Is there any reason I should assume this result is undefined/unreliable on the PolarFire SoC?

dawsfox commented 2 years ago

Also, as a followup question, one of the reasons I wanted a better method of timing is because the LSRAM example seems to show that writing 1024 elements to LSRAM is about 4-5x faster than reading 1024 elements from LSRAM. Assuming that the LSRAM isn't a cacheable region, I thought maybe the writes were benefiting from the write combining buffer and implemented a version that does the same writes and reads with a stride of 32 elements (also tried 64). I also thought perhaps that at the point the timing was recorded post-writes, the writes had all been issued but not necessarily completed, so I tried to use another inline assembly function, this time using the RISCV "FENCE" instruction after all writes and before all reads. Neither of these showed any noticeable change in how long the writes/reads took. I was only wondering if you had any information / intuition about why this discrepancy in speed between reads and writes is showing up.

hughbreslin commented 2 years ago

Oh cool! Nice idea :)

Em I don't think the values returned would be inaccurate but it might be safer to use rdcycle for this, if there was a stall or a wfi it would affect the retired counter.

To be honest I'm not sure why you're seeing the discrepancy, can you let me know if you're doing this using the PDMA or directly from one of the harts reading/writing?

dawsfox commented 2 years ago

I have now seen different results by putting the FENCE instruction after each individual writes; now writes show ~130,000 cycles compared to ~100,000 cycles to read. I am not sure how much of this is due to the overhead of the various fences, or perhaps I'm misunderstanding the effect of the FENCE instructions. I'm assuming this means various write instructions to different elements can be in flight at the same in the absence of FENCE instructions. I know the RISC V ISA specification indicates a relaxed memory model but I am entirely ignorant of the actual memory/memory consistency model that's present on the PolarFire SoC.

As to your question, I'm not sure if its PDMA or directly from the hart. I've tried to follow the code back as far as I can and I've found that the fpga_lsram uio entry seems to use the uio_pdrv_genirq driver (I identified it as uio0, the info can be found in the /sys/class/uio/uio0 directory; the driver file is a link to uio_pdrv_genirq). To me, nothing in that file indicates that PDMA is being used, but I'm very inexperienced with Linux driver code.

hughbreslin commented 2 years ago

Hey @dawsfox I'm still having a look into this so I don't have much of an update, my one main question at the moment is, do you need to do a fence after each individual write or would it be ok in your use case to do a fence after the whole write sequence is complete? The data could be cached in the L1 cache of a hart depending on how the transfer takes place and doing a fence after each write would definitely have a bigger overhead than doing one fence at the end.

dawsfox commented 2 years ago

For my purposes I think doing a fence after the whole write sequence would be sufficient; I only tried the individual fences to try to understand better the discrepancy between the time for writing and the time for reading. At the moment I'm looking into the PolarFire SRAM (AHBLite and AXI) IP core a little more because I plan to use it in conjunction with some math blocks on the FPGA and I found this in the user documentation for the core: "The core does not support outstanding write and read transactions. It de-asserts the ready to the AXI4 Master and re-asserts only when the current transaction is complete." This seems to indicate to me that if there is a mechanism for having more than one outstanding write, it is somewhere in/between the AXI master and the CPU. I hadn't really considered the possibility that the data would be cached on a hart; I really feel I have very little information about the inner workings between the user code causing the write sequence and the data being passed to the fabric interface controller.

hughbreslin commented 2 years ago

Hey @dawsfox sorry for the delay! The COREAXI4INTERCONNECT IP core supports outstanding reads and writes. The interconnect to the LSRAM is configured in performance mode which allows 8 outstanding transactions: image

Have you seen our TRM document? Is there any information I could provide here that would be of help? You can also post questions on the RISC-V exchange forum :)

dawsfox commented 2 years ago

Thanks for your responses @hughbreslin ! I have also come across those COREAXI4INTERCONNECT IP core settings and noticed that. As for the question of difference in write time and read time, it would seem to me that there could be as many outstanding reads and writes? So I'm still not sure what causes the difference in timing. I have certainly looked through the TRM quite a bit and it has been very helpful. I think for now at least I can close this issue (especially since I've seen that RISC V assembly can expose the CSRs in any case). I appreciate your correspondence!