Closed timmy21 closed 11 years ago
I don't know. I would expect Go Msgpack decoding to be slower than python's. Python's uses the C library, which does a "dumb" decoding into some vanilla structures, while Go does a smart decoding into "typed" structs using reflection. There's definitely a cost to that. You will see a difference in time too between json decoding via C and Go.
Having said that, there may be more you can do to ensure the slowdown is not from your part.
More questions:
I can't give you any pointers without these answers.
Also, see and follow this: https://code.google.com/p/go/issues/detail?id=5683#c8
Look at the whole issue. The concerns and responses should apply here too.
Go encoding libraries that the direction of a parse-and-bind model (ie parse it into a real structure) which has its cost. These used to be 2 different libraries in like Java (e.g. JAXB vs JAXP) which expected costs to do the binding.
i'm sure no error will happen when decode file.
TRecord:
type TRecord map[string]interface{}
python code snippet:
class MessagePackRecordParser(object):
def parse(self, stream):
unpacker = msgpack.Unpacker(stream)
for rec in unpacker:
yield rec
class MessagePackRecordLoader(object):
def __init__(self, filename):
self.filename = filename
self.parser = MessagePackRecordParser()
def records(self):
with open(self.filename, 'rb') as fp:
for rec in self.parser.parse(fp):
yield rec
program is spending most of its time allocating memory and garbage collecting (runtime.mallocgc) i write a test only decode []byte for each record, it's still use 13s.
I'm inclined to close this issue, because it's hard to test implementations across languages. It's a lot easier to test across different protocols in the same language (ie benchmark against JSON, GOB, BSON, MSGPACK, BINC) and ensure this library performance surpasses the others.
Go encoding libraries do a parsing and binding in one step, using reflection. That doesn't answer why the discrepancy is wide though, but that is something that may have to be fixed in the runtime.
As the runtime gets more performant, things should get better. There's currently work now to improve the GC, and future work planned to improve allocation, stacks (from segmented to contiguous) and maps in the next release (1.3).
Ok, thank you. how about performance for Msgpack vs Protobuf?
I haven't tested out msgpack vs protobuf. protobuf requires a schema (*.proto) file and pre-compilation, and I'm not familiar enough with it or have the motivation to do the work so as to have a test.
Feel free to test it and and let me know what your results are. And it's easy enough to integrate, I'd integrate into the ext_dep_tests.go file.
I have changed my record from map[string]interface{} to struct, and then decode is faster.
i have test protobuf and msgpack with same records(40 files, and total 2436058 records, all record size almost 160 bytes), GOMAXPROCS is default 1, use 4 goroutine, each goroutine decode 10 files. the result is: decode msgpack used 26.805 seconds, decode protobuf used 8 seconds.
the result only in decode phase. i haven't intergrate protobuf to my program yet, so i don't know if it may slower other parts.
@timmy21 thanks much for continuing to look into this.
msgpack is quite more verbose than protobuf/binc for this use-case. The reasons are here at https://github.com/ugorji/binc/blob/master/FAQ.md . The main reason is that protobuf/binc use symbols/tags to refer to strings which show up multiple times in an encoded stream (eg encode a field name once fully, and encode it every other time as just a 1-byte identifier/tag). The smaller size of binc may make encoding/decoding faster.
To see if it's a msgpack specific issue, can you please also try with binc (for comparison sake). It will be nice to see size of encoded message in binc/msgpack/protobuf, as well as encoding/decoding time.
Binc and Msgpack share a lot of the underlying encoding/decoding logic, with specialized interface implementations for where they differ. If Binc is faster, then we know the issue is mostly due to the msgpack encoding format. Also, protobuf/gob uses "package unsafe" to gain performance advantages, so this could have a bearing also.
See
https://code.google.com/p/go/issues/detail?id=5159
https://code.google.com/p/goprotobuf/source/browse/proto/pointer_unsafe.go
https://code.google.com/p/goprotobuf/source/browse/proto/pointer_reflect.go
(To build msgpack without unsafe, add the build tag appengine. ie. go build/install -tags appengine ... )
@timmy21 can you try again with change: c61e5837a8566751ab31434604dba9f23b126ce9
I added some code that may help your use case. Basically, if we see the same type more than once, we just grab a specific function that handles its en/decoding bypassing a bunch of checks we had to do before.
Since you are encoding a lot of values, this may end up helping somewhat.
Also, can you please try the things I said in the previous message I sent:
Thanks. This will help me get to a resolution on this issue faster. I hate having open issues - they mess with my zen ;)
@timmy21 please try with the latest changes at 265e3b5 , for both binc and msgpack, and relate your findings, especially contrasting with protocol buffers, and including encoding/decoding time and encoded size.
Thanks.
I'd close this issue now for lack of follow up. Please respond with information requested and then we can consider re-opening it if there is anything to fix on our end.
i have a file, with a batch of records, each record is a map with string key and string/int/float value. then i decode this file with code below:
this file contains 874589 records, is there any wrong? i am confused.