riscvarchive / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
961 stars 273 forks source link

Implications of widely distributed draft 0.7.1 implementation(s) #667

Closed brucehoult closed 3 years ago

brucehoult commented 3 years ago

We all would love to see only implementations compatible with the soon to be ratified 1.0 standard. This would ease the software burden and require Linux (etc) distros to have to deal with only RV64GC and RV64GCV variants.

However this is not the world we find ourselves in.

Allwinner is now in full production of their D1 SoC using the Alibaba/T-Head C906 core. I've heard the initial run is 5 million chips. Probably most are destined for some embedded application, but Sipeed has announced they will be selling Linux SBCs using this chip starting at $12.50, and Pine64 has said they'll have one for "under $10". Possibly there will be others too.

At those kinds of prices these may quickly become the most used RISC-V Linux boards.

I don't think we can pretend they don't exist.

About a week ago an engineer at Sipeed offered to run on Allwinner's reference board any test programs I sent him. I sent him my primes benchmark http://hoult.org/primes.txt and we found that at 1.0 GHz it runs 8.7% faster than on the HiFIve Unleashed at 1.0 GHz. I then sent a test that simply probed vsetvli and found that the returned value for e8,m1 is max(avl,16).

This SoC has a vector unit.

Up until this point Sipeed didn't know whether Allwinner had bought the optional vector unit or not. Sipeed has C906 manuals from T-Head, and had a few reference boards from Allwinner, but didn't have documentation from Allwinner yet.

I then got ssh access to a board in Beijing via an engineer at RVboards.org in Shenzhen.

I've run tests on memcpy() and strcpy() using the standard glibc in Debian 11 and using RVV versions basically from the 0.7.1 spec. The results are here:

http://hoult.org/d1_memcpy.txt http://hoult.org/d1_strcpy.txt

RVV memcpy() is 60% faster for a zero-length memcpy, rising to 3x faster between 8 bytes and 128 bytes, close to 2x at 4K, then gradually converging to both doing 1.1 GB/s once out of L1 cache.

Incidentally, I ran the same test on HiFive Unleashed. It peaked at 1.9 GB/s at 2K size (about the same as the Allwinner using standard glibc) then dropped gradually to 30 MB/s at 64 MB size. The Allwinner maintains 1.1 GB/s at 64 MB size.

RVV strcpy() is 50% faster at 0 string length, 3x faster at 32 bytes and maintains more than 2x all the way to main memory.

It is definitely desirable for owners of these boards that glibc in standard Linux distributions gets versions of these and similar library functions that are RVV optimised for their boards, alongside versions for 1.0-compliant implementations and plain RV64GC systems.

I would say it is desirable for the distribution maintainers also.

It would be nice if at least this kind of simple function could have identical binaries across 0.7.1 and 1.0. Failing that, we need a reliable and easy way to detect which version of V we have.

In general, these functions are going to be fine using just e8,m1 to e8,m8, which have identical encodings in vsetvli across all versions. (this is not true of e16 and e32).

When they match the element size set in VTYPE, 0.7.1's vlb.v/vlbu.v and vsb.v have the same semantics as 1.0's vle8.v and vse8.v. It also seems the store and the old unsigned load have the same encoding, if I haven't had any finger slips.

vlb.v v0,(a1): 12058007

vlbu.v v0,(a1): 02058007 vle8.v v0,(a1): 02058007

vsb.v v0,(a1): 02058027 vse8.v v0,(a1): 02058027

Obviously there are many many things which have zero chance of being compatible between 0.7.1 and the ratified Vector extension. Anything using mixed element sizes without changing VTYPE in between, for a start.

But this one ... unit stride loads and stores of 8 bit elements ... seems to be doable. And it's enough for memcpy.

I haven't checked the instructions needed for strcpy/strlen yet (vlbuff.v, vmseq.vi, vmfirst.m, vmsif.m) but I will shortly.

I just thought I'd get the discussion going sooner rather than later.

David-Horner commented 3 years ago

I am top posting because the initial points I want to make are not specific to the details you mention. I agree that mechanisms to support existing hardware should be advocated by, and supported by RISCV.org. I want to point out that this example is not an outlier but rather an direct and natural consequence of the revolution in hardware design RISCV has birthed. How to support "in development" specification hardware releases will be challenging, but it is only one aspect of the need to support a diverse production of chips/hardware designs. The existing typical way does not serve the RISSCV community well. Incuments benefit from a system that has been geared to their needs over decadses. For RISCV to realize its potential bold and innovative approaches are needed. Consider errata. Ad hoc patches and fixes to software and hardware across the full gamut of the industry is the current approach. Large manufacturers can persuade industry partners and even competitors to accommodate their design/production/manufacturing flaws. Small operators are not afforded the same expenditure of effort by OS, mid-layer, compiler/toolchain and integrators [hardware and software, debuggers/optimizers/IDEs]. A single nominal flaw can put a producer out of business ARM, Intel/AMD and IBM may not care about such players, but I understand that we do.

Similarly, for too long, Custom ISA extension support has been almost non-existent. This is our big advantage and a major differentiator that we have not leveraged. On another thread someone mentioned RISCY and problems/complications there. [I still hope to hear the specifics]. These concerns, Bruce's well articulated issue and, in general, support for "imperfect or experimental" designs and implementations are the make of break of RISCV.world domination; it is key for acceptance across all hardware levels.

Having said all that, how do we do it. I have various ideas, but I don't think any of mine are needed as a catalyst to develop the motivation and then the infrastructure to address the unique and mundane challenges raise by RISCV ISA and RISCV.org.

On Sat, Apr 24, 2021 at 2:42 PM Bruce Hoult @.***> wrote:

We all would love to see only implementations compatible with the soon to be ratified 1.0 standard. Tis would ease the software burden and require Linux (etc) distros to have to deal with only RV64GC and RV64GCV variants.

However this is not the world we find ourselves in.

Allwinner is now in full production of their D1 SoC using the Alibaba/T-Head C906 core. I've heard the initial run is 5 million chips. Probably most are destined for some embedded application, but Sipeed has announced they will be selling Linux SBCs using this chip starting at $12.50, and Pine64 has said they'll have one for "under $10". POssibly there will be others too.

At those kinds of prices these may quickly become the most used RISC-V Linux boards.

I don't think we can pretend they don't exist.

About a week ago an engineer at Sipeed offered to run on Allwinner's reference board any test programs I sent him. I sent him my primes benchmark http://hoult.org/primes.txt and we found that at 1.0 GHz it runs 8.7% faster than on the HiFIve Unleashed at 1.0 GHz. I then sent a test that simply probed vsetvli and found that the returned value for e8,m1 is max(avl,16).

This SoC has a vector unit.

Up until this point Sipeed didn't know whether Allwinner had bought the optional vector unit or not. Sipeed has C906 manuals from T-Head, and had a few reference boards from Allwinner, but didn't have documentation from Allwinner yet.

I then got ssh access to a board in Beijing via an engineer at RVboards.org in Shenzhen.

I've run tests on memcpy() and strcpy() using the standard glibc in Debian 11 and using RVV versions basically from the 0.7.1 spec. The results are here:

http://hoult.org/d1_memcpy.txt http://hoult.org/d1_strcpy.txt

RVV memcpy() is 60% faster for a zero-length memcpy, rising to 3x faster between 8 bytes and 128 bytes, close to 2x at 4K, then gradually converging to both doing 1.1 GB/s once out of L1 cache.

Incidentally, I ran the same test on HiFive Unleashed. It peaked at 1.9 GB/s at 2K size (the same as the Allwinner using standard glibc) then dropped gradually to 30 MB/s at 64 MB size. The Allwinner maintains 1.1 GB/s at 64 MB size.

RVV strcpy() is 50% faster at 0 string length, 3x faster at 32 bytes and maintains more than 2x all the way to main memory.

It is definitely desirable for owners of these boards that glibc in standard Linux distributions gets versions of these and similar library functions that are RVV optimised for their boards, alongside versions for 1.0-compliant implementations and plain RV64GC systems.

I would say it is desirable for the distribution maintainers also.

It would be nice if at least this kind of simple function could have identical binaries across 07.1 and 1.0. Failing that, we need a reliable and easy way to detect which version of V we have.

In general, these functions are going to be fine using just e8,m1 to e8,m8, which have identical encodings in vsetvli across all versions. (this is not true of e16 and e32).

When they match the element size set in VTYPE, 0.7.1's vlb.v/vlbu.v and vsb.v have the same semantics as 1.0's vle8.v and vse8.v. However they don't have the same encoding.

vlb.v v0,(a1): 12058007

vlbu.v v0,(a1): 02058007 vle8.v v0,(a1): 00058007

vsb.v v0,(a1): 02058027 vse8.v v0,(a1): 00058027

vlbu.v and vle8.v differ in only one bit! And the same with vsb.v and vse8.v This is MEW in the 1.0 encoding.

If the sense of MEW could be reversed, the encodings would be compatible.

Obviously there are many many things which have zero chance of being compatible between 0.7.1 and the ratified Vector extension.

But this one .. unit stride loads and stores of 8 bit elements ... could be made to match. And it's enough for memcpy.

I haven't checked the instructions needed for strcpy/strlen yet (vlbuff.v, vmseq.vi, vmfirst.m, vmsif.m).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKMYYDVHPAEGTYMZ6OLTKMGKXANCNFSM43QM5QKQ .

brucehoult commented 3 years ago

I expect to have an SBC using the Allwinner D1 with T-Head Xuantie 906 core within the next week and will be able to conduct more extensive investigations of it.

The $99 "Nezha" board is currently available to the public on indiegogo (from Sipeed) and aliexpress (from RVboards). Production of the SoC is currently ramping up. Sipeed are still promising an ~$10 board based on this SoC later in the year which should be a very serious competitor to the Raspberry Pi Zero, and possible sell in relatively large numbers.

I will have more analysis here later but for now I have three concrete proposals to increase compatibility between these chips implementing 0.7.1 and those implementing 1.0:

Revert b8cd98bc946 "Make vlmul bits contiguous in vtype."

This will restore vtype compatibility between 0.7.1 and 1.0 for integral LMUL by putting the vsew field in the same place, allowing code using any of e8,e16,e32 to work on both the Alibaba cores and RVV 1.0 (assuming certain other conditions are met).

Because of changes in loads and stores, code can only be compatible between 0.7.1 and 1.0 if the element width of the load/store is the same as vsew in the current vtype.

Once I have the board I will check its behaviour with respect to masked off and tail elements.

In 0.7.1 masked off elements are required to be unchanged, so the definition of vma=0 as undisturbed is good.

Draft 0.7.1 required vector operations to zero vector tail elements. This is not an option in the current spec, which allows only "undisturbed" and "agnostic" (destination elements can be either undisturbed or all 1s, and differ from element to element). Clearly code that depends on 0.7.1 zeroing the tail elements can not work on 1.0. Similarly 1.0 code that depends on tail elements being undisturbed can not work on 0.7.1.

Code using 1.0 "tail agnostic" will work on 0.7.1 if it is truly agnostic and does not depend on the elements being either undisturbed or all 1s.

The current spec notes: "The value of all 1s instead of all 0s was chosen for the overwrite value to discourage software developers from depending on the value written."

This is a reasonable point of view, but not a very strong one.

If the overwrite value was all 0s then the 0.7.1 tail zeroing behaviour would be a conforming implementation of Tail Agnistic.

In this case, it would be better if the encoding of vta was 0 for agnostic and 1 for undisturbed, not the reverse as at present.

Summary:

The following changes to the current draft spec would very significantly increase binary compatibility between draft 0.7.1 and the current draft, enabling carefully-written code to work on both.

As a concrete goal, I would like to see a situation in which all standard library routines that benefit from the V extension can be written to be binary compatible between 0.7.1 and 1.0. This would enable these very common functions to have to choose only between scalar and vectorized versions, not between scalar and multiple vectorized versions.

nick-knight commented 3 years ago

change at least Tail Agnostic to allow all 0s as a valid result, either in place or or in addition to all 1s

I proposed a form of this over at #650 and ran into strong resistance. Good luck!

aswaterman commented 3 years ago

We are not going to change the architecture to cater to an implementation of a draft standard.

In the long run, these 5M incompatible chips will represent a tiny fraction of the RVV ecosystem. Hobbling either the software or the architecture to support them will look foolish in a decade.

brucehoult commented 3 years ago

I contest that there is any "hobbling" involved.

The exact layout of vtype or whether the sense of vta is inverted is invisible to and of no consequence to assembly language programs or programmers or even to compilers that generate assembly language. It is also of no consequence to hardware and hardware designers, other than of course that they have to follow the spec and grab the right bits.

These are essentially arbitrary choices, with no effect on functionality or performance.

The cost is extremely low, and the benefits in the next year or two of increased compatibility with the only RISC-V hardware with a vector ISA (not the RISC-V Vector extension, but very very similar) actually on the market are significant.

I believe this should be given wider consideration, not arbitrarily rejected.

David-Horner commented 3 years ago

@brucehoult - I am glad you continue to post even though the "issue" was closed. I agree that the pronouncement/reports of this death has been greatly exaggerated [appreciations to Mark Twain].

The discussion in the TG eventually sided on the restructuring because it was cleaner to have the bits contiguous. The polarity was a secondary concern that more readily resolved once the contiguous bits was "resolved".

The argument that the specification has not been ratified works both ways. Public review could favour the 0.7.1 format for the preexistant bits.

This discussion is by no means finalized in the Vector TG. We move along with consensus, but always are open to revisiting the "tentative decisions", especially when the matters are "bit twiddling".

@all However, the software currently being developed can readily accommodate mixed implementation access [via discovery or flag setting]. We must not penalize early adopters unnecessarily. We should be as accommodating as we can be to those who in good faith support our efforts. We definitely should not be draconian nor Machiavellian.

Similarly, providing zero fill for tail or mask agnostic is in keeping with the spirit of the concept. Ones fill was not stipulated for any other reason than acknowledging that allowing anything was not verifiable. Certification with this zero fill as an exception is eminently reasonable.

The discussion in #650 centred on revising the standard. At this point standardizing with zero fill included would be counter to the agnostic message, diluting it and potentially thwarting portable code before the feature and concept becomes entrenched/instinctive in the community.

We can be flexible and allow choice. We can have our cake and eat it too; when such is a implementation choice. Software and conventions can give us substantial flexibility, to be used when such choice has a definite benefit and low long term maintenance cost.

Navigating this will be challenging, but I believe we collectively are up to that challenge. [Please don't prove me wrong]..

aswaterman commented 3 years ago

After these walls of text, I stand by my decision to close this ticket. You can propose a new ISA extension that's compatible with this non-RVV chip if you so choose. This is not the place to do so.

David-Horner commented 3 years ago

Certainly you can call me out for rhetoric, bruce called you out for yours. Wall or prose path to enlightenment, we can let those following the discussion decide. I am, after all, pro-choice.

But I fully disagree with your conclusion that this is not the correct forum for this discussion.

I mentioned, these issues were discussed within the TG meetings. No one there and then said that these issues were off-topic. No one said that they were uncontentious and that there was only one path forward.

It is precisely because of those TG discussions that we must conclude that this is the appropriate forum. Further, to the extent that there are various remedies possible, presupposing the "a new ISA extension" is the only mechanism for resolution is narrow minded and repressive.

I will raise this issue at the next Vector TG even if its status remains closed, indeed, especially because it was closed prematurely.

brucehoult commented 3 years ago

I'm unsure what to think about "walls of text". I try to keep my posts succinct, but at the same time an argument requires exposition and evidence.

I don't feel an extension is the appropriate thing here. I suppose one could propose a complete "RVV draft 0.7.1 compatibility mode" extension. In a way that would of course be ideal, but supporting 0.7.1 and 1.0 as modes on the same CPU is far more hardware implementation work than I propose, and I doubt anyone would do it.

At most I'm proposing a "decode vtype differently" mode, along with something like a profile -- a list of opcodes and certain coding restrictions which if followed give binary code that works the same on both 0.7.1 and 1.0.

But as 1.0 is not yet ratified, and no one has shipped hardware using the current draft, by far the simpler course is to simply have 1.0 use a vtype format that is backwardly compatible with 0.7.1 -- as it still had less than a year ago.

One relevant point here is that Alibaba have not implemented 0.7.1 EDIV, and thus the chips expect those two vtype bits to be 0. There is no need to try to map EDIV to fractional LMUL. The chips that are shipping support the standard LMUL 1,2,4,8 and SEW 8,16,32 for integer and 32 for FP. They have 24 valid vtype values.

brucehoult commented 7 months ago

Revisiting this almost three years later, I believe even more strongly that the wrong decision was made here and gratuitous incompatibility was introduced for merely aesthetic reasons.

The following is just a survey of where we are currently at. I'm not calling for any action now, just hoping to prevent such mistakes in future in situations where what is going to happen is so clearly obvious.

There are of course genuine improvements in RVV 1.0 and fundamental incompatibilities. Code using various 1.0 features is hard to port back to 0.7, but 0.7 code is easy to port to 1.0 except for the issues raised here, chiefly tail-zeroing not being technically compliant with 1.0 tail agnostic (though it would take perverse code to not actually work)

In the long run, these 5M incompatible chips will represent a tiny fraction of the RVV ecosystem. Hobbling either the software or the architecture to support them will look foolish in a decade.

According to Alibaba: "Since the launch of the XuanTie C910, shipments of the XuanTie series of chips have exceeded 4 billion units".

So that's three orders of magnitude off, and though 1.0 is finally starting to emerge, 0.7 is not going away any time in the next five years.

https://www.scmp.com/tech/big-tech/article/3255830/alibabas-damo-academy-plans-launch-latest-version-its-xuantie-risc-v-processor-year

C906 with RVV 0.7.1 is everywhere. Half a dozen companies have chips and boards in the $5-$10 range with them, running embedded Linux in 64 MB RAM. It's in my car's media/nav player (Allwinner F133) C910 with RVV 0.7.1 (sometimes called C920) is in the highest performing RISC-V machine available, the 64 core 2.0 GHz 128 GB RAM Milk-V Pioneer.

Many of these boards are being GIVEN to developers with suitable project proposals by RISC-V International.

The world's only publicly available RISC-V hosting, Scaleway's “Elastic Metal RV1”, uses Sipeed compute module clusters with quad core C910 with RVV 0.7.1 and 16 GB RAM, at a price of 16 euro a month or 4 euro-cents per hour.

Meanwhile, there is right now only still only one RVV 1.0 SoC available, the Kendryte K230, with a single 1.6 GHz C908 CPU. It's been available for a couple of months on the CanMV-K230 SBC with 0.5 GB RAM, making it not a great software development machine compared to the 8 GB or 16 GB quad core U74 and C910 boards.

"Horse Creek", if it ever hits the market, doesn't have RVV at all.

More serious RVV 1.0 things are coming this year. Finally.

Banana Pi have pre-announced an 8 core 1.6 GHz board (BPI-F3) with RVA22+V. No price or availability date yet.

Sophgo are updating the 64 core SG2042 to an RVA22+RVV 1.0 version. Also hopefully supporting FP exceptions and not trapping on unknown fence instructions (e.g. fence.tso) :-)

Sophgo also have the upcoming 16 core P670 (and some X280s) SG2380 with Milk-V and Sipeed promising boards this year. That's the most exciting general purpose machine announced as the cores are twice the speed of C910 and U74.

That's great stuff, but it's going to take many years for them to overtake the RVV 0.7.1 installed base, which won't be standing still.

To be fair, I don't know what is happening in embedded space. Maybe RVV 1.0 is already shipping bigtime there. If so, no one is talking about it.

Anyway that's not relevant to questions such as what ISA(s) shrink-wrapped Linux distros should support.

Speaking of which, the best recent software news is that not only is RVV 0.7.1 support (under the name "xtheadvector") now upstreamed into the upcoming gcc 14 (and matching binutils), the standard C intrinsics for RVV can generate code for either RVV 0.7.1 or 1.0.

Someone recently published on Reddit a project using fairly extensive RVV intrinsics. They didn't have any 1.0 hardware and could only test it on qemu. I was able to very quickly and easily (just changing compiler options in the makefile, no source code changes needed) able to use gcc 14 to build and run it in RVV 0.7.1 on my C910-based LicheePi 4A.

https://reddit.com/r/RISCV/comments/1b57gib/comment/kt4t428/

That now clears the way to libraries using the ifunc (or other) mechanism to provide RV64GC, RVV 1.0, and RVV 0.7.1 implementations, either hand writing two (very similar) RVV assembly language versions or one version using intrinsics and just compile it twice.

nick-knight commented 7 months ago

I acknowledge Bruce's comments. While I don't have strong opinions myself about the current state of affairs, I do feel more strongly about a certain terminological issue:

RVV 0.7.1 support (under the name "xtheadvector")

I prefer to restrict "RVV" to refer to the ratified standard V-extension, and "XTheadVector" to refer to Alibaba's (nonconforming, nonstandard) extension.

I acknowledge that there is some inertia referring to XTheadVector as "RVV 0.7.1", so I'll leave it up to Alibaba to fight a marketing battle.