Schema-less decoding is too slow(32s), but python-msgpack just use 2s.

timmy21 commented 11 years ago

i have a file, with a batch of records, each record is a map with string key and string/int/float value. then i decode this file with code below:

type MsgpackReader struct {
    fp  *os.File
    dec *codec.Decoder
}

func NewMsgpackReader(filename string) *MsgpackReader {
    fp, err := os.Open(filename)
    if err != nil {
        return nil
    }
    r := bufio.NewReader(fp)
    mh := codec.MsgpackHandle{}
    mh.MapType = reflect.TypeOf(map[string]interface{}(nil))
    mh.RawToString = true
    return &MsgpackReader{fp: fp, dec: codec.NewDecoder(r, &mh)}
}

func (reader *MsgpackReader) Read() (record TRecord, err error) {
    err = reader.dec.Decode(&record)
    return
}

func (reader *MsgpackReader) Close() {
    reader.fp.Close()
}

func main() {
    reader := NewMsgpackReader(filename)
    defer reader.Close()
    for {
        _, err := reader.Read()
        if err != nil {
            if err == io.EOF {
                break
            } else {
                continue
            }
        }
    }
}

this file contains 874589 records, is there any wrong? i am confused.

ugorji commented 11 years ago

I don't know. I would expect Go Msgpack decoding to be slower than python's. Python's uses the C library, which does a "dumb" decoding into some vanilla structures, while Go does a smart decoding into "typed" structs using reflection. There's definitely a cost to that. You will see a difference in time too between json decoding via C and Go.

Having said that, there may be more you can do to ensure the slowdown is not from your part.

Pass a pointer to a TRecord into your Read method, and update that pointer. As opposed to using a return parameter
Fix your error checking. This way, you're sure that an error didn't already happen midway in your reading, and consequently mess up the decoder state (making it invalid). var trec TRecord err = reader.Read(&trec) if err == nil { continue } if err == io.EOF { break } return err

More questions:

What does TRecord look like?
What does your python code look like?

I can't give you any pointers without these answers.

ugorji commented 11 years ago

Also, see and follow this: https://code.google.com/p/go/issues/detail?id=5683#c8

Look at the whole issue. The concerns and responses should apply here too.

Go encoding libraries that the direction of a parse-and-bind model (ie parse it into a real structure) which has its cost. These used to be 2 different libraries in like Java (e.g. JAXB vs JAXP) which expected costs to do the binding.

timmy21 commented 11 years ago

i'm sure no error will happen when decode file.

TRecord:

type TRecord map[string]interface{}

python code snippet:

class MessagePackRecordParser(object):

    def parse(self, stream):
        unpacker = msgpack.Unpacker(stream)
        for rec in unpacker:
            yield rec

class MessagePackRecordLoader(object):

    def __init__(self, filename):
        self.filename = filename
        self.parser = MessagePackRecordParser()

    def records(self):
        with open(self.filename, 'rb') as fp:
            for rec in self.parser.parse(fp):
                yield rec

timmy21 commented 11 years ago

program is spending most of its time allocating memory and garbage collecting (runtime.mallocgc) i write a test only decode []byte for each record, it's still use 13s.

ugorji commented 11 years ago

I'm inclined to close this issue, because it's hard to test implementations across languages. It's a lot easier to test across different protocols in the same language (ie benchmark against JSON, GOB, BSON, MSGPACK, BINC) and ensure this library performance surpasses the others.

Go encoding libraries do a parsing and binding in one step, using reflection. That doesn't answer why the discrepancy is wide though, but that is something that may have to be fixed in the runtime.

As the runtime gets more performant, things should get better. There's currently work now to improve the GC, and future work planned to improve allocation, stacks (from segmented to contiguous) and maps in the next release (1.3).

timmy21 commented 11 years ago

Ok, thank you. how about performance for Msgpack vs Protobuf?

ugorji commented 11 years ago

I haven't tested out msgpack vs protobuf. protobuf requires a schema (*.proto) file and pre-compilation, and I'm not familiar enough with it or have the motivation to do the work so as to have a test.

Feel free to test it and and let me know what your results are. And it's easy enough to integrate, I'd integrate into the ext_dep_tests.go file.

timmy21 commented 11 years ago

I have changed my record from map[string]interface{} to struct, and then decode is faster.

i have test protobuf and msgpack with same records(40 files, and total 2436058 records, all record size almost 160 bytes), GOMAXPROCS is default 1, use 4 goroutine, each goroutine decode 10 files. the result is: decode msgpack used 26.805 seconds, decode protobuf used 8 seconds.

the result only in decode phase. i haven't intergrate protobuf to my program yet, so i don't know if it may slower other parts.

ugorji commented 11 years ago

@timmy21 thanks much for continuing to look into this.

msgpack is quite more verbose than protobuf/binc for this use-case. The reasons are here at https://github.com/ugorji/binc/blob/master/FAQ.md . The main reason is that protobuf/binc use symbols/tags to refer to strings which show up multiple times in an encoded stream (eg encode a field name once fully, and encode it every other time as just a 1-byte identifier/tag). The smaller size of binc may make encoding/decoding faster.

To see if it's a msgpack specific issue, can you please also try with binc (for comparison sake). It will be nice to see size of encoded message in binc/msgpack/protobuf, as well as encoding/decoding time.

Binc and Msgpack share a lot of the underlying encoding/decoding logic, with specialized interface implementations for where they differ. If Binc is faster, then we know the issue is mostly due to the msgpack encoding format. Also, protobuf/gob uses "package unsafe" to gain performance advantages, so this could have a bearing also.

See https://code.google.com/p/go/issues/detail?id=5159
https://code.google.com/p/goprotobuf/source/browse/proto/pointer_unsafe.go https://code.google.com/p/goprotobuf/source/browse/proto/pointer_reflect.go

(To build msgpack without unsafe, add the build tag appengine. ie. go build/install -tags appengine ... )

ugorji commented 11 years ago

@timmy21 can you try again with change: c61e5837a8566751ab31434604dba9f23b126ce9

I added some code that may help your use case. Basically, if we see the same type more than once, we just grab a specific function that handles its en/decoding bypassing a bunch of checks we had to do before.

Since you are encoding a lot of values, this may end up helping somewhat.

Also, can you please try the things I said in the previous message I sent:

try and report results with Binc (just change MsgpackHandle to BincHandle) including time and encoded size
try and report results with MsgpackHandle including time and encoded size
try and report results with appengine build for protobuf (add -tags appengine when building)

Thanks. This will help me get to a resolution on this issue faster. I hate having open issues - they mess with my zen ;)

ugorji commented 11 years ago

@timmy21 please try with the latest changes at 265e3b5 , for both binc and msgpack, and relate your findings, especially contrasting with protocol buffers, and including encoding/decoding time and encoded size.

Thanks.

ugorji commented 11 years ago

I'd close this issue now for lack of follow up. Please respond with information requested and then we can consider re-opening it if there is anything to fix on our end.

ugorji / go

Schema-less decoding is too slow(32s), but python-msgpack just use 2s. #10