Open howjmay opened 1 year ago
Thank you! Looking forward to your PR. And if it is possible to know how did you run the test?
Thank you! Looking forward to your PR. And if it is possible to know how did you run the test?
Hi @howjmay, opened PR https://github.com/howjmay/neon2rvv/pull/309 I've been running the tests like so:
CROSS_COMPILE=/path/to/toolchain/riscv64-unknown-linux-gnu- make CC=/path/to/toolchain/clang CXX=/path/to/toolchain/clang++ SIMULATOR_TYPE=qemu ENABLE_TEST_ALL=1 test
Thank you for sharing!
Thank you for sharing!
No problem. So I've got a couple of questions, first regarding the number of Neon intrinsics. According to this website there are 2185 intrinsics for v7, 2754 intrinsics for A32, and 4344 for A64. I presume A32 contains all of v7, and A64 contains all of A32? Looking at this GCC header file for A32? there are around 2700 intrinsics, then the header file for aarch64 has around 3800 plus a couple for f16/bf16 separately, but still falling short of 4344.
Maybe you know where the complete list of 4344 are defined? And what is this project going to cover? Do you eventually plan to include Zvfh/Zvfbfwma? Vector crypto would also help when available with vclz/vcpop/carryless multiplication.
According to this website there are 2185 intrinsics for v7, 2754 intrinsics for A32, and 4344 for A64. I presume A32 contains all of v7, and A64 contains all of A32?
Not sure whether I miunderstood you, but I think there are some intrinsics are only A64.
Looking at this GCC header file for A32? there are around 2700 intrinsics, then the header file for aarch64 has around 3800 plus a couple for f16/bf16 separately, but still falling short of 4344.
This is my fault. In the beginning of this project I was directly copying my local arm_neon.h
file on M1 machine. I notice it has a lack of some intrinsics, but I didn't have tine to add them, and I not sure whether I delete those intrinsics accidentally in the beginning.
Do you eventually plan to include Zvfh/Zvfbfwma However, I am not sure the f16/bf16 parts are necessary. How do you think about it? It it is necessary I think I am good for working on it And the poly part I don't think will put it in the first priority too.
Not sure whether I miunderstood you, but I think there are some intrinsics are only A64.
No I meant the other way around, I hope there isn't anything which is not included in A64.
However, I am not sure the f16/bf16 parts are necessary. How do you think about it? It it is necessary I think I am good for working on it And the poly part I don't think will put it in the first priority too.
I wouldn't say they are a priority, I don't think you can do poly without vector crypto anyway. My concern was at the moment the neon2rvv header contains ~1700 intrinsics, even if we add 278 for f16, 81 for bf16, 115 for poly, we're still a long way to 4344. Would be good to understand what exactly isn't covered.
ok I will check what exactly I missed this weekend. Thank you!
I roughly checked it. It is my fault that I didn't copy all the functions to neon2rvv.h
. I need to implement a proper parse to do it
The crawler is in this PR https://github.com/howjmay/neon2rvv/pull/310
Nice, I was thinking of doing the same. But apparently the website is displaying the wrong things when the requested value is too high: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&first=4000 Should have shown results 4001-4020, instead shows 4201-4220.
Made a scraper to click through the next page in the table instead which seems to have worked CSV file with all intrinsics: neon_intrinsics.csv And breakdown of what is currently covered:
Neon2RVV coverage:
Total 1643 / 4344
Bit manipulation 39 / 74
Bit manipulation / Bitwise clear 16 / 16
Bit manipulation / Bitwise select 18 / 28
Bit manipulation / Count leading sign bits 0 / 12
Bit manipulation / Count leading zeros 1 / 12
Bit manipulation / Population count 4 / 6
Compare 90 / 300
Compare / Absolute greater than 2 / 9
Compare / Absolute greater than or equal to 2 / 9
Compare / Absolute less than 2 / 9
Compare / Absolute less than or equal to 2 / 9
Compare / Bitwise equal 14 / 28
Compare / Bitwise equal to zero 0 / 31
Compare / Bitwise not equal to zero 12 / 22
Compare / Equal to 0 / 3
Compare / Greater than 14 / 42
Compare / Greater than or equal to 14 / 42
Compare / Greater than or equal to zero 0 / 3
Compare / Greater than zero 0 / 3
Compare / Less than 14 / 42
Compare / Less than or equal to 14 / 42
Compare / Less than or equal to zero 0 / 3
Compare / Less than zero 0 / 3
Complex arithmetic 0 / 62
Complex arithmetic / Complex addition 0 / 10
Complex arithmetic / Complex multiply-accumulate 0 / 20
Complex arithmetic / Complex multiply-accumulate by scalar 0 / 32
Cryptography 0 / 35
Cryptography / AES 0 / 4
Cryptography / CRC32 0 / 8
Cryptography / SHA1 0 / 6
Cryptography / SHA256 0 / 4
Cryptography / SHA512 0 / 4
Cryptography / SM3 0 / 7
Cryptography / SM4 0 / 2
Data type conversion 153 / 635
Data type conversion / Conversions 9 / 195
Data type conversion / Reinterpret casts 144 / 440
Load 165 / 451
Load / Load 0 / 1
Load / Stride 165 / 450
Logical 90 / 124
Logical / AND 16 / 16
Logical / Bit clear and exclusive OR 0 / 8
Logical / Bitwise NOT 12 / 14
Logical / Exclusive OR 16 / 24
Logical / Exclusive OR and rotate 0 / 1
Logical / Negate 8 / 16
Logical / OR 16 / 16
Logical / OR-NOT 16 / 16
Logical / Rotate and exclusive OR 0 / 1
Logical / Saturating Negate 6 / 12
Move 21 / 53
Move / Narrow 6 / 12
Move / Saturating narrow 9 / 27
Move / Vector move 0 / 2
Move / Widen 6 / 12
Scalar arithmetic 84 / 184
Scalar arithmetic / Fused multiply-accumulate by scalar 0 / 8
Scalar arithmetic / Vector multiply by scalar 20 / 40
Scalar arithmetic / Vector multiply by scalar and widen 8 / 24
Scalar arithmetic / Vector multiply-accumulate by scalar 24 / 50
Scalar arithmetic / Vector multiply-accumulate by scalar and widen 18 / 26
Scalar arithmetic / Vector multiply-subtract by scalar 14 / 36
Shift 232 / 348
Shift / Left / Vector rounding shift left 16 / 18
Shift / Left / Vector saturating rounding shift left 12 / 24
Shift / Left / Vector saturating shift left 40 / 60
Shift / Left / Vector shift left 32 / 36
Shift / Left / Vector shift left and insert 16 / 24
Shift / Left / Vector shift left and widen 6 / 12
Shift / Right / Vector rounding shift right 16 / 18
Shift / Right / Vector rounding shift right and accumulate 16 / 18
Shift / Right / Vector rounding shift right and narrow 6 / 12
Shift / Right / Vector saturating rounding shift right and narrow 9 / 27
Shift / Right / Vector saturating shift right and narrow 9 / 27
Shift / Right / Vector shift right 16 / 18
Shift / Right / Vector shift right and accumulate 16 / 18
Shift / Right / Vector shift right and insert 16 / 24
Shift / Right / Vector shift right and narrow 6 / 12
Store 120 / 331
Store / Store 0 / 1
Store / Stride 120 / 330
Table lookup 16 / 72
Table lookup / Extended table lookup 6 / 33
Table lookup / Table lookup 10 / 39
Vector arithmetic 421 / 1081
Vector arithmetic / Absolute / Absolute difference 14 / 21
Vector arithmetic / Absolute / Absolute difference and accumulate 12 / 12
Vector arithmetic / Absolute / Absolute value 8 / 16
Vector arithmetic / Absolute / Saturating absolute value 6 / 12
Vector arithmetic / Absolute / Widening absolute difference 6 / 12
Vector arithmetic / Absolute / Widening absolute difference and accumulate 6 / 12
Vector arithmetic / Across vector arithmetic / Addition across vector 0 / 17
Vector arithmetic / Across vector arithmetic / Addition across vector widening 0 / 12
Vector arithmetic / Across vector arithmetic / Maximum across vector 0 / 15
Vector arithmetic / Across vector arithmetic / Maximum across vector (IEEE754) 0 / 3
Vector arithmetic / Across vector arithmetic / Minimum across vector 0 / 15
Vector arithmetic / Across vector arithmetic / Minimum across vector (IEEE754) 0 / 3
Vector arithmetic / Add / Addition 18 / 25
Vector arithmetic / Add / Narrowing addition 36 / 48
Vector arithmetic / Add / Saturating addition 16 / 48
Vector arithmetic / Add / Widening addition 12 / 24
Vector arithmetic / Division 0 / 7
Vector arithmetic / Dot product 0 / 28
Vector arithmetic / Matrix multiply 0 / 4
Vector arithmetic / Maximum 14 / 26
Vector arithmetic / Minimum 18 / 34
Vector arithmetic / Multiply / Fused multiply-accumulate 4 / 78
Vector arithmetic / Multiply / Multiplication 14 / 28
Vector arithmetic / Multiply / Multiply extended 0 / 29
Vector arithmetic / Multiply / Multiply-accumulate 28 / 34
Vector arithmetic / Multiply / Multiply-accumulate and widen 12 / 24
Vector arithmetic / Multiply / Saturating multiply 10 / 18
Vector arithmetic / Multiply / Saturating multiply by scalar and widen 20 / 48
Vector arithmetic / Multiply / Saturating multiply-accumulate 16 / 48
Vector arithmetic / Multiply / Saturating multiply-accumulate by element 8 / 24
Vector arithmetic / Multiply / Saturating multiply-accumulate by scalar and widen 4 / 8
Vector arithmetic / Multiply / Widening multiplication 6 / 11
Vector arithmetic / Pairwise arithmetic / Pairwise addition 7 / 23
Vector arithmetic / Pairwise arithmetic / Pairwise addition and widen 24 / 24
Vector arithmetic / Pairwise arithmetic / Pairwise maximum 7 / 23
Vector arithmetic / Pairwise arithmetic / Pairwise maximum (IEEE754) 0 / 3
Vector arithmetic / Pairwise arithmetic / Pairwise minimum 7 / 20
Vector arithmetic / Pairwise arithmetic / Pairwise minimum (IEEE754) 0 / 6
Vector arithmetic / Polynomial / Polynomial addition 0 / 7
Vector arithmetic / Polynomial / Polynomial multiply 0 / 6
Vector arithmetic / Reciprocal / Reciprocal estimate 4 / 18
Vector arithmetic / Reciprocal / Reciprocal exponent 0 / 2
Vector arithmetic / Reciprocal / Reciprocal square-root estimate 4 / 20
Vector arithmetic / Reciprocal / Reciprocal step 0 / 3
Vector arithmetic / Rounding 10 / 66
Vector arithmetic / Square root 0 / 7
Vector arithmetic / Subtract / Narrowing subtraction 24 / 36
Vector arithmetic / Subtract / Saturating subtract 16 / 24
Vector arithmetic / Subtract / Subtraction 18 / 25
Vector arithmetic / Subtract / Widening subtraction 12 / 24
Vector manipulation 212 / 594
Vector manipulation / Combine vectors 9 / 15
Vector manipulation / Copy vector lane 0 / 56
Vector manipulation / Create vector 9 / 15
Vector manipulation / Extract one element from vector 18 / 52
Vector manipulation / Extract vector from a pair of vectors 18 / 28
Vector manipulation / Reverse bits within elements 0 / 6
Vector manipulation / Reverse elements 26 / 38
Vector manipulation / Set all lanes to the same value 54 / 118
Vector manipulation / Set vector lane 18 / 30
Vector manipulation / Split vectors 18 / 32
Vector manipulation / Transpose elements 14 / 68
Vector manipulation / Unzip elements 14 / 68
Vector manipulation / Zip elements 14 / 68
That is super useful!!! Thank you!! Are you going to add them to neon2rvv?
Here are the scripts, don't know Go so I used Python. Scrapper:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
with open("neon_intrinsics.csv", 'a') as file:
file.write("ReturnType,Name,Arguments,Group\n")
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]")
driver.maximize_window()
driver.find_element(By.XPATH, "//button[text()='Accept and hide this message ']").click()
wait = WebDriverWait(driver, 5)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'c-table')))
sum = 0
for i in range(0, 218):
data = driver.page_source
soup = BeautifulSoup(data, 'html.parser')
table = soup.find_all(lambda tag: tag.name == "table" and tag.has_attr("class") and ("c-table" in tag.get("class")))[0]
all_tr = table.find('tbody').find_all('tr')
sum += len(all_tr)
print(i, sum)
for tr in all_tr:
td = tr.find_all('td')
file.write(f"{td[2].string},{td[3].string},\"{td[4].string}\",{td[5].string}\n")
element = driver.find_element(By.TAG_NAME, "ads-pagination").shadow_root.find_element(By.CLASS_NAME, "c-pagination-action--next")
# element.click()
driver.execute_script("arguments[0].click();", element)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'c-table')))
driver.close()
Coverage:
import pandas as pd
import re
with open("neon2rvv.h", 'r') as file:
data = file.read()
result = re.findall(r"^FORCE_INLINE .+? (v.+?)\(.*\)", data, flags=re.MULTILINE)
intrinsics = set(result)
df = pd.read_csv("neon_intrinsics.csv")
# for data_type in ["float16_t", "float16x4_t", "float16x8_t", "poly8_t", "poly8x8_t", "poly8x16_t", "poly16_t", "poly16x4_t", "poly16x8_t", "poly64_t", "poly64x1_t", "poly64x2_t", "poly128_t"]:
# df = df[~df["ReturnType"].str.contains(data_type)]
# df = df[~df["Arguments"].str.contains(data_type)]
# df.reset_index()
# df.to_csv("neon_filtered.csv", index=False)
# df_unimplemented = df[~df["Name"].isin(intrinsics)]
# df_unimplemented.to_csv("neon_unimplemented.csv", index=False)
primary_group_list = []
secondary_group_list = sorted(list(set(df["Group"].to_list())))
for group in secondary_group_list:
primary_group_list.append(group.split(" / ")[0])
primary_group_list = sorted(list(set(primary_group_list)))
print("Neon2RVV coverage:")
print("Total", len(intrinsics), "/", len(set(df["Name"].to_list())))
for primary_group in primary_group_list:
df_primary = df[df["Group"].str.contains(primary_group)]
primary_set = set(df_primary["Name"].to_list())
intrinsics_count = len(primary_set)
intersection = len(intrinsics.intersection(primary_set))
print(primary_group, "\t", intersection, "/", intrinsics_count)
for secodary_group in [group for group in secondary_group_list if primary_group in group]:
df_secondary = df_primary[df_primary["Group"] == secodary_group]
secondary_set = set(df_secondary["Name"].to_list())
intrinsics_count = len(secondary_set)
intersection = len(intrinsics.intersection(secondary_set))
print("\t", secodary_group, "\t", intersection, "/", intrinsics_count)
print()
I am busy recently. I will add the missing intrinsics in the coming week
all added
Hello @howjmay, nice work with this project! I've built the tests(on x86 host) and got the following results: Using GCC 14.0.1 (g7af0f1e107a):
Using Clang 19.0 (4cf458c696047d6d2991c121da7a5c165ff747ce):
Running on QEMU v8.1.1. Also seen some additional failures when building with different optimization levels. I've identified some of the issues, can provide fixes in a couple of days.