llir / llvm

Library for interacting with LLVM IR in pure Go.
https://llir.github.io/document/
BSD Zero Clause License
1.18k stars 78 forks source link

ast parse error #212

Closed pupiles closed 2 years ago

pupiles commented 2 years ago

Hi,

invoke void (%"class.std::__cxx11::basic_string"*, i32 (i8*, i64, i8*, %struct.__va_list_tag*)*, i64, i8*, ...) @_ZN9__gnu_cxx12__to_xstringINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEcEET_PFiPT0_mPKS8_P13__va_list_tagEmSB_z(%"class.std::__cxx11::basic_string"* nonnull sret align 8 %6, i32 (i8*, i64, i8*, %struct.__va_list_tag*)* nonnull @vsnprintf, i64 32, i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str.39, i64 0, i64 0), i64 %137)
          to label %138 unwind label %204, !dbg !86865

The codes above are ast parsed error when used asm.ParseFile because "class.std::__cxx11::basic_string" seems not supported. Could you pass me some hints on that, really appreciate that. test.cpp.o.zip

mewmew commented 2 years ago

@pupiles, I tried compiling the LLVM IR example you provided using Clang (13.0.0), but get the same error:

u@x1 /t/foo [1]> clang -o foo foo.ll
foo.ll:1437:287: error: expected '('
  invoke void (%"class.std::__cxx11::basic_string"*, i32 (i8*, i64, i8*, %struct.__va_list_tag*)*, i64, i8*, ...) @_ZN9__gnu_cxx12__to_xstringINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEcEET_PFiPT0_mPKS8_P13__va_list_tagEmSB_z(%"class.std::__cxx11::basic_string"* nonnull sret align 8 %6, i32 (i8*, i64, i8*, %struct.__va_list_tag*)* nonnull @vsnprintf, i64 32, i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str.39, i64 0, i64 0), i64 %137)
                                                                                                                                                                                                                                                                                              ^
1 error generated.

So, it seems the official LLVM tools are not able to parse the gdpr_handler.cpp.o LLVM IR file. Try generating a new one using Clang, version 13.0.0.

Cheers, Robin

dannypsnl commented 2 years ago

@pupiles btw, do you know what generates this file? It might be a new feature of llvm

pupiles commented 2 years ago

@mewmew @dannypsnl, It is generated by clang11, and clang11 can disassemble correctly by llc-11, but I can't parse it using llir either v0.3.3(llvm11) or v0.3.4(llvm12).

dannypsnl commented 2 years ago

@mewmew

func main() {
    m := ir.NewModule()
    basic_string_t := m.NewTypeDef("class.std::__cxx11::basic_string", types.NewStruct(types.I8))
    vsn_printf := m.NewFunc("vsnprintf", types.I32,
        ir.NewParam("", types.NewPointer(types.I8)),
        ir.NewParam("", types.I64),
        ir.NewParam("", types.NewPointer(types.I8)),
    )
    vsn_printf.Sig.Variadic = true
    invokee := m.NewFunc("_ZN9__gnu_cxx12__to_xstringINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEcEET_PFiPT0_mPKS8_P13__va_list_tagEmSB_z",
        types.Void,
        ir.NewParam("", basic_string_t),
        ir.NewParam("", vsn_printf.Typ),
        ir.NewParam("", types.I64),
        ir.NewParam("", types.NewPointer(types.I8)),
    )
    mF := m.NewFunc("main", types.I32)
    mB := mF.NewBlock("")
    f := m.NewGlobal("", basic_string_t)
    i := m.NewGlobal("", types.I64)
    p := m.NewGlobal("", types.NewPointer(types.I8))
    mB.NewInvoke(invokee, []value.Value{f, vsn_printf, i, p}, nil, nil)
    println(m.String())
}

A draft start

mewmew commented 2 years ago

It seems there are two primary issues with the LLVM IR file that causes the parsing to fail.

Firstly, the sret parameter attributes (without explicit type) are valid for LLVM 11.0, but not for LLVM 13.0 (see parseRequiredTypeAttr of the official LLVM source code). In LLVM 13.0, an explicit type is needed, e.g.

sret(i8)

This was verified by trying to parse the original gdpr_handler.ll file using opt -S -o foo_13.ll gdpr_handler.ll using opt from LLVM 13.0.

A work-around is simply to remove sret from the input LLVM IR file.

Secondly, there is a known issue with llir/llvm where it is unable to parse align attributes. This is due to a LR(1) shift/reduce ambiguity in the original LLVM IR grammar (as described in #40).

If we remove align and sret attributes, then llir/llvm is able to parse the output produced by opt -S foo_13.ll gdpr_handler.ll using LLVM 13.0, when using the llvm13 branch of llir/llvm. Note, support for the DIFlagExportSymbols enum was added in 4653d58ae05b354c7a4743132cdbe96abbed965d.

Cheers, Robin

pupiles commented 2 years ago

@mewmew Thanks for your reply, For the sret parameter attribute, it indicates the return value of the function,so i think it is the important for Data flow analysis,It may not be a good decision to remove directly.If I don’t care about the explicit type, is there any other solution? For the align \d+ attribute, it only indicates the specified alignment, so it can be remove. Someone may not care about strict llvm ir. Is it feasible to provide an option switch when lexical parsing encounters align \d+ ambiguity just to ignore them instead of reporting errors?

mewmew commented 2 years ago

Thanks for your reply,

You are most welcome :)

For the sret parameter attribute, it indicates the return value of the function,so i think it is the important for Data flow analysis,It may not be a good decision to remove directly.If I don’t care about the explicit type, is there any other solution?

The grammar of LLVM 11.0 supported implicit sret, but for LLVM 13.0, an explicit type is required. This is true also for the official LLVM distribution.

Someone may not care about strict llvm ir. Is it feasible to provide an option switch when lexical parsing encounters align \d+ ambiguity just to ignore them instead of reporting errors?

That's a good idea. I'm not sure if it is possible, but definitely worth investigating.

Would you care to take a look @pupiles?

The generated lexer and parser are in llir/ll, and the grammar is at llir/grammar. The tool used to generate the lexer and parser is Textmapper. There are some documentation for Textmapper at https://textmapper.org/

Cheers, Robin

dannypsnl commented 2 years ago

It seems there are two primary issues with the LLVM IR file that causes the parsing to fail.

Firstly, the sret parameter attributes (without explicit type) are valid for LLVM 11.0, but not for LLVM 13.0 (see parseRequiredTypeAttr of the official LLVM source code). In LLVM 13.0, an explicit type is needed, e.g.

sret(i8)

This was verified by trying to parse the original gdpr_handler.ll file using opt -S -o foo_13.ll gdpr_handler.ll using opt from LLVM 13.0.

A work-around is simply to remove sret from the input LLVM IR file.

Secondly, there is a known issue with llir/llvm where it is unable to parse align attributes. This is due to a LR(1) shift/reduce ambiguity in the original LLVM IR grammar (as described in #40).

If we remove align and sret attributes, then llir/llvm is able to parse the output produced by opt -S foo_13.ll gdpr_handler.ll using LLVM 13.0, when using the llvm13 branch of llir/llvm. Note, support for the DIFlagExportSymbols enum was added in 4653d58.

Cheers, Robin

Maybe off-topic, but perhaps we take asm parser source code from llvm source code, compile and link with our Go code? The problem I can see is

  1. Have to check license
  2. We better port this back to the old version of different llvm mapping
  3. We need to convert c/c++ struct back to our Go struct

The benefit I can see is

  1. we get the same behavior as official one
  2. we don't maintain a parser
  3. LR(1) is not enough for IR
  4. more accurate errors
dannypsnl commented 2 years ago

@mewmew since 13 just get supported, would this get solved?

mewmew commented 2 years ago

Given that the llvm13 branch has been merged into master, the work-around mentioned in https://github.com/llir/llvm/issues/212#issuecomment-999690598 should be enough to parse the LLVM IR example source.

The align ambiguity still remain, but this issue is already tracked by #40. So we can safely close this issue.

Cheers, Robin

P.S. feel free to re-open this issue or a new one if there is a parse error related to LLVM 13.0 or LLVM 14.0.