howjmay / neon2rvv

A translator from ARM NEON intrinsics to RISCV-V Extension implementation
MIT License
23 stars 7 forks source link

Support verifying test implementation on both ARM and x86 #31

Open howjmay opened 1 year ago

OMaghiarIMG commented 8 months ago

Hello @howjmay, nice work with this project! I've built the tests(on x86 host) and got the following results: Using GCC 14.0.1 (g7af0f1e107a):

NEON2RVV_TEST Complete!
Passed:  1481
Failed:  1
Ignored: 209
Coverage rate: 87.58%

Using Clang 19.0 (4cf458c696047d6d2991c121da7a5c165ff747ce):

NEON2RVV_TEST Complete!
Passed:  1276
Failed:  206
Ignored: 209
Coverage rate: 75.46%

Running on QEMU v8.1.1. Also seen some additional failures when building with different optimization levels. I've identified some of the issues, can provide fixes in a couple of days.

howjmay commented 8 months ago

Thank you! Looking forward to your PR. And if it is possible to know how did you run the test?

OMaghiarIMG commented 8 months ago

Thank you! Looking forward to your PR. And if it is possible to know how did you run the test?

Hi @howjmay, opened PR https://github.com/howjmay/neon2rvv/pull/309 I've been running the tests like so:

CROSS_COMPILE=/path/to/toolchain/riscv64-unknown-linux-gnu- make CC=/path/to/toolchain/clang CXX=/path/to/toolchain/clang++ SIMULATOR_TYPE=qemu ENABLE_TEST_ALL=1 test
howjmay commented 8 months ago

Thank you for sharing!

OMaghiarIMG commented 8 months ago

Thank you for sharing!

No problem. So I've got a couple of questions, first regarding the number of Neon intrinsics. According to this website there are 2185 intrinsics for v7, 2754 intrinsics for A32, and 4344 for A64. I presume A32 contains all of v7, and A64 contains all of A32? Looking at this GCC header file for A32? there are around 2700 intrinsics, then the header file for aarch64 has around 3800 plus a couple for f16/bf16 separately, but still falling short of 4344.

Maybe you know where the complete list of 4344 are defined? And what is this project going to cover? Do you eventually plan to include Zvfh/Zvfbfwma? Vector crypto would also help when available with vclz/vcpop/carryless multiplication.

howjmay commented 8 months ago

According to this website there are 2185 intrinsics for v7, 2754 intrinsics for A32, and 4344 for A64. I presume A32 contains all of v7, and A64 contains all of A32?

Not sure whether I miunderstood you, but I think there are some intrinsics are only A64.

Looking at this GCC header file for A32? there are around 2700 intrinsics, then the header file for aarch64 has around 3800 plus a couple for f16/bf16 separately, but still falling short of 4344.

This is my fault. In the beginning of this project I was directly copying my local arm_neon.h file on M1 machine. I notice it has a lack of some intrinsics, but I didn't have tine to add them, and I not sure whether I delete those intrinsics accidentally in the beginning.

Do you eventually plan to include Zvfh/Zvfbfwma However, I am not sure the f16/bf16 parts are necessary. How do you think about it? It it is necessary I think I am good for working on it And the poly part I don't think will put it in the first priority too.

OMaghiarIMG commented 8 months ago

Not sure whether I miunderstood you, but I think there are some intrinsics are only A64.

No I meant the other way around, I hope there isn't anything which is not included in A64.

However, I am not sure the f16/bf16 parts are necessary. How do you think about it? It it is necessary I think I am good for working on it And the poly part I don't think will put it in the first priority too.

I wouldn't say they are a priority, I don't think you can do poly without vector crypto anyway. My concern was at the moment the neon2rvv header contains ~1700 intrinsics, even if we add 278 for f16, 81 for bf16, 115 for poly, we're still a long way to 4344. Would be good to understand what exactly isn't covered.

howjmay commented 8 months ago

ok I will check what exactly I missed this weekend. Thank you!

howjmay commented 8 months ago

I roughly checked it. It is my fault that I didn't copy all the functions to neon2rvv.h. I need to implement a proper parse to do it

howjmay commented 8 months ago

The crawler is in this PR https://github.com/howjmay/neon2rvv/pull/310

OMaghiarIMG commented 8 months ago

Nice, I was thinking of doing the same. But apparently the website is displaying the wrong things when the requested value is too high: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&first=4000 Should have shown results 4001-4020, instead shows 4201-4220.

Made a scraper to click through the next page in the table instead which seems to have worked CSV file with all intrinsics: neon_intrinsics.csv And breakdown of what is currently covered:

Neon2RVV coverage:
Total 1643 / 4344
Bit manipulation     39 / 74
     Bit manipulation / Bitwise clear    16 / 16
     Bit manipulation / Bitwise select   18 / 28
     Bit manipulation / Count leading sign bits      0 / 12
     Bit manipulation / Count leading zeros      1 / 12
     Bit manipulation / Population count     4 / 6

Compare      90 / 300
     Compare / Absolute greater than     2 / 9
     Compare / Absolute greater than or equal to     2 / 9
     Compare / Absolute less than    2 / 9
     Compare / Absolute less than or equal to    2 / 9
     Compare / Bitwise equal     14 / 28
     Compare / Bitwise equal to zero     0 / 31
     Compare / Bitwise not equal to zero     12 / 22
     Compare / Equal to      0 / 3
     Compare / Greater than      14 / 42
     Compare / Greater than or equal to      14 / 42
     Compare / Greater than or equal to zero     0 / 3
     Compare / Greater than zero     0 / 3
     Compare / Less than     14 / 42
     Compare / Less than or equal to     14 / 42
     Compare / Less than or equal to zero    0 / 3
     Compare / Less than zero    0 / 3

Complex arithmetic   0 / 62
     Complex arithmetic / Complex addition   0 / 10
     Complex arithmetic / Complex multiply-accumulate    0 / 20
     Complex arithmetic / Complex multiply-accumulate by scalar      0 / 32

Cryptography     0 / 35
     Cryptography / AES      0 / 4
     Cryptography / CRC32    0 / 8
     Cryptography / SHA1     0 / 6
     Cryptography / SHA256   0 / 4
     Cryptography / SHA512   0 / 4
     Cryptography / SM3      0 / 7
     Cryptography / SM4      0 / 2

Data type conversion     153 / 635
     Data type conversion / Conversions      9 / 195
     Data type conversion / Reinterpret casts    144 / 440

Load     165 / 451
     Load / Load     0 / 1
     Load / Stride   165 / 450

Logical      90 / 124
     Logical / AND   16 / 16
     Logical / Bit clear and exclusive OR    0 / 8
     Logical / Bitwise NOT   12 / 14
     Logical / Exclusive OR      16 / 24
     Logical / Exclusive OR and rotate   0 / 1
     Logical / Negate    8 / 16
     Logical / OR    16 / 16
     Logical / OR-NOT    16 / 16
     Logical / Rotate and exclusive OR   0 / 1
     Logical / Saturating Negate     6 / 12

Move     21 / 53
     Move / Narrow   6 / 12
     Move / Saturating narrow    9 / 27
     Move / Vector move      0 / 2
     Move / Widen    6 / 12

Scalar arithmetic    84 / 184
     Scalar arithmetic / Fused multiply-accumulate by scalar     0 / 8
     Scalar arithmetic / Vector multiply by scalar   20 / 40
     Scalar arithmetic / Vector multiply by scalar and widen     8 / 24
     Scalar arithmetic / Vector multiply-accumulate by scalar    24 / 50
     Scalar arithmetic / Vector multiply-accumulate by scalar and widen      18 / 26
     Scalar arithmetic / Vector multiply-subtract by scalar      14 / 36

Shift    232 / 348
     Shift / Left / Vector rounding shift left   16 / 18
     Shift / Left / Vector saturating rounding shift left    12 / 24
     Shift / Left / Vector saturating shift left     40 / 60
     Shift / Left / Vector shift left    32 / 36
     Shift / Left / Vector shift left and insert     16 / 24
     Shift / Left / Vector shift left and widen      6 / 12
     Shift / Right / Vector rounding shift right     16 / 18
     Shift / Right / Vector rounding shift right and accumulate      16 / 18
     Shift / Right / Vector rounding shift right and narrow      6 / 12
     Shift / Right / Vector saturating rounding shift right and narrow   9 / 27
     Shift / Right / Vector saturating shift right and narrow    9 / 27
     Shift / Right / Vector shift right      16 / 18
     Shift / Right / Vector shift right and accumulate   16 / 18
     Shift / Right / Vector shift right and insert   16 / 24
     Shift / Right / Vector shift right and narrow   6 / 12

Store    120 / 331
     Store / Store   0 / 1
     Store / Stride      120 / 330

Table lookup     16 / 72
     Table lookup / Extended table lookup    6 / 33
     Table lookup / Table lookup     10 / 39

Vector arithmetic    421 / 1081
     Vector arithmetic / Absolute / Absolute difference      14 / 21
     Vector arithmetic / Absolute / Absolute difference and accumulate   12 / 12
     Vector arithmetic / Absolute / Absolute value   8 / 16
     Vector arithmetic / Absolute / Saturating absolute value    6 / 12
     Vector arithmetic / Absolute / Widening absolute difference     6 / 12
     Vector arithmetic / Absolute / Widening absolute difference and accumulate      6 / 12
     Vector arithmetic / Across vector arithmetic / Addition across vector   0 / 17
     Vector arithmetic / Across vector arithmetic / Addition across vector widening      0 / 12
     Vector arithmetic / Across vector arithmetic / Maximum across vector    0 / 15
     Vector arithmetic / Across vector arithmetic / Maximum across vector (IEEE754)      0 / 3
     Vector arithmetic / Across vector arithmetic / Minimum across vector    0 / 15
     Vector arithmetic / Across vector arithmetic / Minimum across vector (IEEE754)      0 / 3
     Vector arithmetic / Add / Addition      18 / 25
     Vector arithmetic / Add / Narrowing addition    36 / 48
     Vector arithmetic / Add / Saturating addition   16 / 48
     Vector arithmetic / Add / Widening addition     12 / 24
     Vector arithmetic / Division    0 / 7
     Vector arithmetic / Dot product     0 / 28
     Vector arithmetic / Matrix multiply     0 / 4
     Vector arithmetic / Maximum     14 / 26
     Vector arithmetic / Minimum     18 / 34
     Vector arithmetic / Multiply / Fused multiply-accumulate    4 / 78
     Vector arithmetic / Multiply / Multiplication   14 / 28
     Vector arithmetic / Multiply / Multiply extended    0 / 29
     Vector arithmetic / Multiply / Multiply-accumulate      28 / 34
     Vector arithmetic / Multiply / Multiply-accumulate and widen    12 / 24
     Vector arithmetic / Multiply / Saturating multiply      10 / 18
     Vector arithmetic / Multiply / Saturating multiply by scalar and widen      20 / 48
     Vector arithmetic / Multiply / Saturating multiply-accumulate   16 / 48
     Vector arithmetic / Multiply / Saturating multiply-accumulate by element    8 / 24
     Vector arithmetic / Multiply / Saturating multiply-accumulate by scalar and widen   4 / 8
     Vector arithmetic / Multiply / Widening multiplication      6 / 11
     Vector arithmetic / Pairwise arithmetic / Pairwise addition     7 / 23
     Vector arithmetic / Pairwise arithmetic / Pairwise addition and widen   24 / 24
     Vector arithmetic / Pairwise arithmetic / Pairwise maximum      7 / 23
     Vector arithmetic / Pairwise arithmetic / Pairwise maximum (IEEE754)    0 / 3
     Vector arithmetic / Pairwise arithmetic / Pairwise minimum      7 / 20
     Vector arithmetic / Pairwise arithmetic / Pairwise minimum (IEEE754)    0 / 6
     Vector arithmetic / Polynomial / Polynomial addition    0 / 7
     Vector arithmetic / Polynomial / Polynomial multiply    0 / 6
     Vector arithmetic / Reciprocal / Reciprocal estimate    4 / 18
     Vector arithmetic / Reciprocal / Reciprocal exponent    0 / 2
     Vector arithmetic / Reciprocal / Reciprocal square-root estimate    4 / 20
     Vector arithmetic / Reciprocal / Reciprocal step    0 / 3
     Vector arithmetic / Rounding    10 / 66
     Vector arithmetic / Square root     0 / 7
     Vector arithmetic / Subtract / Narrowing subtraction    24 / 36
     Vector arithmetic / Subtract / Saturating subtract      16 / 24
     Vector arithmetic / Subtract / Subtraction      18 / 25
     Vector arithmetic / Subtract / Widening subtraction     12 / 24

Vector manipulation      212 / 594
     Vector manipulation / Combine vectors   9 / 15
     Vector manipulation / Copy vector lane      0 / 56
     Vector manipulation / Create vector     9 / 15
     Vector manipulation / Extract one element from vector   18 / 52
     Vector manipulation / Extract vector from a pair of vectors     18 / 28
     Vector manipulation / Reverse bits within elements      0 / 6
     Vector manipulation / Reverse elements      26 / 38
     Vector manipulation / Set all lanes to the same value   54 / 118
     Vector manipulation / Set vector lane   18 / 30
     Vector manipulation / Split vectors     18 / 32
     Vector manipulation / Transpose elements    14 / 68
     Vector manipulation / Unzip elements    14 / 68
     Vector manipulation / Zip elements      14 / 68
howjmay commented 8 months ago

That is super useful!!! Thank you!! Are you going to add them to neon2rvv?

OMaghiarIMG commented 8 months ago

Here are the scripts, don't know Go so I used Python. Scrapper:

from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

with open("neon_intrinsics.csv", 'a') as file:
    file.write("ReturnType,Name,Arguments,Group\n")
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

    driver.get("https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]")
    driver.maximize_window()
    driver.find_element(By.XPATH, "//button[text()='Accept and hide this message ']").click()
    wait = WebDriverWait(driver, 5)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'c-table')))

    sum = 0
    for i in range(0, 218):
        data = driver.page_source
        soup = BeautifulSoup(data, 'html.parser')
        table = soup.find_all(lambda tag: tag.name == "table" and tag.has_attr("class") and ("c-table" in tag.get("class")))[0]
        all_tr = table.find('tbody').find_all('tr')
        sum += len(all_tr)
        print(i, sum)
        for tr in all_tr:
            td = tr.find_all('td')
            file.write(f"{td[2].string},{td[3].string},\"{td[4].string}\",{td[5].string}\n")

        element = driver.find_element(By.TAG_NAME, "ads-pagination").shadow_root.find_element(By.CLASS_NAME, "c-pagination-action--next")
        # element.click()
        driver.execute_script("arguments[0].click();", element)
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'c-table')))
    driver.close()

Coverage:

import pandas as pd
import re

with open("neon2rvv.h", 'r') as file:
    data = file.read()
    result = re.findall(r"^FORCE_INLINE .+? (v.+?)\(.*\)", data, flags=re.MULTILINE)
    intrinsics = set(result)

df = pd.read_csv("neon_intrinsics.csv")

# for data_type in ["float16_t", "float16x4_t", "float16x8_t", "poly8_t", "poly8x8_t", "poly8x16_t", "poly16_t", "poly16x4_t", "poly16x8_t", "poly64_t", "poly64x1_t", "poly64x2_t", "poly128_t"]:
#     df = df[~df["ReturnType"].str.contains(data_type)]
#     df = df[~df["Arguments"].str.contains(data_type)]
# df.reset_index()
# df.to_csv("neon_filtered.csv", index=False)
# df_unimplemented = df[~df["Name"].isin(intrinsics)]
# df_unimplemented.to_csv("neon_unimplemented.csv", index=False)

primary_group_list = []
secondary_group_list = sorted(list(set(df["Group"].to_list())))

for group in secondary_group_list:
    primary_group_list.append(group.split(" / ")[0])
primary_group_list = sorted(list(set(primary_group_list)))

print("Neon2RVV coverage:")
print("Total", len(intrinsics), "/", len(set(df["Name"].to_list())))

for primary_group in primary_group_list:
    df_primary = df[df["Group"].str.contains(primary_group)]
    primary_set = set(df_primary["Name"].to_list())
    intrinsics_count = len(primary_set)
    intersection = len(intrinsics.intersection(primary_set))
    print(primary_group, "\t", intersection, "/", intrinsics_count)

    for secodary_group in [group for group in secondary_group_list if primary_group in group]:
        df_secondary = df_primary[df_primary["Group"] == secodary_group]
        secondary_set = set(df_secondary["Name"].to_list())
        intrinsics_count = len(secondary_set)
        intersection = len(intrinsics.intersection(secondary_set))
        print("\t", secodary_group, "\t", intersection, "/", intrinsics_count)
    print()
howjmay commented 7 months ago

I am busy recently. I will add the missing intrinsics in the coming week

howjmay commented 7 months ago

all added