Closed debabratadebroy closed 4 years ago
I would construct the builder with the anticipated size, as you would avoid the internal buffer from resizing a few times. Log 2 (8,000,000) is ~23, so you are copying large portions of data 23 times unnecessarily.
Also, 8000000 20 (8 + 8) ~= 2.4 GB of data, that is a big contiguous array! I would reduce your test size and use a more precise timer to do the comparisons.
You can ignore 8m polygons. For me 1m and 3m is very important
Can you try updating the builder default allocation and post the comparison? Even for 1 and 3 mil, that is 20+ doublings.
I have tried it. Do u suggest any default value
Actually my question is for 3m polygons with 5 vertices it's taking 1.5 sec. I am using 16 cores 32gb ram. Protobuff is taking half for same volume
Your rawSize
variable is close to the size of the default value i would suggest. Just make sure it is in # of bytes.
I tried with b := flatbuffers.NewBuilder(rawSize * 1024) But dont see much improvement. Its still coming around 1.3 sec for 3M polygons with 5 vertices
Just to add one more point if same thing I am doing in Java i can see its done within 800 ms for 3M X 5 vertices
See also my response to the same question here: https://stackoverflow.com/questions/60506304/flatbuffer-serialization-performance-is-slow-compared-to-protobuf/60512615
Interesting serialization comparison, i try it out tomorrow.
hi @tsingson are you able to reproduce the same behavior
Yes, I reproduced this comparison. as result below: flatc is 4, protoc is 1.
test data generate
package generate
import (
"math/rand"
)
type GenerateData struct {
X float64
Y float64
}
func Generate(size int) []GenerateData {
g := make([]GenerateData, 0)
for i := 0; i < size; i++ {
t := GenerateData{
X: rand.Float64(),
Y: rand.Float64(),
}
g = append(g, t)
}
return g
}
main.go
package main
import (
"fmt"
"time"
"github.com/google/flatbuffers/playground/flat"
"github.com/google/flatbuffers/playground/generate"
"github.com/google/flatbuffers/playground/proto"
)
func main() {
StartedAt := time.Now()
size := 1024 * 1024 * 8
data := generate.Generate(size)
DeElapseTime := time.Since(StartedAt).String()
fmt.Println("size: ", size, " time: ", DeElapseTime)
fmt.Println(" len of data: ", len(data))
fmt.Println("--------------------------------------")
f := FlatGenerate(data)
StartedAt = time.Now()
f.Marshal()
DeElapseTime = time.Since(StartedAt).String()
fmt.Println("size: ", size, " flatc time: ", DeElapseTime)
fmt.Println(" len of data: ", len(data))
fmt.Println("--------------------------------------")
g := ProtoGenerate(data)
StartedAt = time.Now()
_, _ = g.Marshal()
DeElapseTime = time.Since(StartedAt).String()
fmt.Println("size: ", size, " protoc time: ", DeElapseTime)
fmt.Println(" len of data: ", len(data))
fmt.Println("--------------------------------------")
}
func FlatGenerate(in []generate.GenerateData) *flat.LayerT {
vlist := make([]*flat.VerticesT, 0)
for _, v := range in {
vl := &flat.VerticesT{X: v.X, Y: v.Y}
vlist = append(vlist, vl)
}
pl := &flat.PolygonT{}
pl.Polygons = vlist
ly := &flat.LayerT{
Polygon: []*flat.PolygonT{pl},
}
return ly
}
func ProtoGenerate(in []generate.GenerateData) *proto.Layer {
vlist := make([]*proto.Vertices, 0)
for _, v := range in {
vl := &proto.Vertices{X: v.X, Y: v.Y}
vlist = append(vlist, vl)
}
pl := &proto.Polygon{}
pl.Polygons = vlist
ly := &proto.Layer{
Polygon: []*proto.Polygon{pl},
}
return ly
}
the f.Marshal() is shortcut , in flatc generated go file layer.go , like this:
func (rcv *LayerT) Marshal() []byte {
b := flatbuffers.NewBuilder(0)
b.Finish(rcv.Pack(b))
return b.FinishedBytes()
}
proto IDL
syntax = "proto3";
package proto;
message Vertices {
double x = 1;
double y = 2;
}
message Polygon {
repeated Vertices polygons = 1;
}
message Layer {
repeated Polygon polygon = 1;
}
flatc IDL
// Generated from simple.proto
namespace flat;
table Vertices {
x:double;
y:double;
}
table Polygon {
polygons:[Vertices];
}
table Layer {
polygon:[Polygon];
}
result:
size: 8388608 time: 592.923046ms
len of data: 8388608
--------------------------------------
size: 8388608 flatc time: 1.044466263s
len of data: 8388608
--------------------------------------
size: 8388608 protoc time: 249.879179ms
len of data: 8388608
--------------------------------------
I am currently writing a go verifier for flatc, so I also consider using goroutine or some tech to improve the encoding performance of flatc.
The original flatc encoder in go , it's used a writetable() function to sequentially process the encoding in a loop. I had split the vector / struct / scalar encoding into separate encodings function (it's good for me , so , maybe use parallel processing next round ), and only rewrite the header / root table when the all part in root-table is generated
i m learning more from c++ / c , in active try in go way.
ignore main.FlatGenerate, this function is prepare testing data
Flat | Flat% | Sum% | Cum | Cum% | Name | Inlined?
-- | -- | -- | -- | -- | -- | --
2.50s | 26.37% | 26.37% | 2.50s | 26.37% | runtime.memmove | =============> MARK 1
0.86s | 9.07% | 35.44% | 4.50s | 47.47% | github.com/google/flatbuffers/benchmark/flat.(*VerticesT).Pack |
0.50s | 5.27% | 40.72% | 0.50s | 5.27% | runtime.markBits.isMarked | (inline)
0.39s | 4.11% | 44.83% | 1.63s | 17.19% | runtime.scanobject |
0.38s | 4.01% | 48.84% | 0.78s | 8.23% | math/rand.(*lockedSource).Int63 |
0.37s | 3.90% | 52.74% | 0.37s | 3.90% | sync.(*Mutex).Unlock | (inline)
0.37s | 3.90% | 56.65% | 0.37s | 3.90% | runtime.memclrNoHeapPointers |
0.33s | 3.48% | 60.13% | 2.05s | 21.62% | main.FlatGenerate |
0.24s | 2.53% | 62.66% | 0.34s | 3.59% | runtime.findObject |
0.22s | 2.32% | 64.98% | 0.22s | 2.32% | runtime.heapBits.bits |
0.21s | 2.22% | 67.19% | 0.44s | 4.64% | runtime.wbBufFlush1 |
0.21s | 2.22% | 69.41% | 0.23s | 2.43% | runtime.pageIndexOf | (inline)
0.21s | 2.22% | 71.62% | 0.21s | 2.22% | runtime.(*mspan).init | (inline)
0.21s | 2.22% | 73.84% | 0.21s | 2.22% | github.com/google/flatbuffers/go.WriteUint32 | (inline)
0.21s | 2.22% | 76.05% | 0.44s | 4.64% | github.com/google/flatbuffers/go.(*Builder).WriteVtable |
0.20s | 2.11% | 78.16% | 0.20s | 2.11% | runtime.nanotime1 |
0.19s | 2.00% | 80.17% | 0.19s | 2.00% | runtime.(*mspan).refillAllocCache |
0.19s | 2.00% | 82.17% | 2.89s | 30.49% | github.com/google/flatbuffers/go.(*Builder).Prep |
0.12s | 1.27% | 83.44% | 0.13s | 1.37% | github.com/google/flatbuffers/go.vtableEqual |
0.11s | 1.16% | 84.60% | 0.77s | 8.12% | runtime.greyobject |
0.11s | 1.16% | 85.76% | 0.12s | 1.27% | github.com/google/flatbuffers/go.(*Builder).StartObject |
0.09s | 0.95% | 86.71% | 0.09s | 0.95% | runtime.heapBitsSetType |
0.09s | 0.95% | 87.66% | 0.11s | 1.16% | github.com/google/flatbuffers/go.(*Builder).Pad |
0.08s | 0.84% | 88.50% | 0.09s | 0.95% | runtime.spanOf | (inline)
0.08s | 0.84% | 89.35% | 2.99s | 31.54% | github.com/google/flatbuffers/go.(*Builder).PrependFloat64Slot |
0.07s | 0.74% | 90.08% | 0.91s | 9.60% | math/rand.Float64 |
0.07s | 0.74% | 90.82% | 2.90s | 30.59% | github.com/google/flatbuffers/go.(*Builder).PrependFloat64 |
0.06s | 0.63% | 91.46% | 0.84s | 8.86% | math/rand.(*Rand).Float64 | (inline)
0.06s | 0.63% | 92.09% | 1.58s | 16.67% | github.com/google/flatbuffers/benchmark/generate.Generate |
0.05s | 0.53% | 92.62% | 2.88s | 30.38% | github.com/google/flatbuffers/benchmark/flat.VerticesAddX | (inline)
0.04s | 0.42% | 93.04% | 0.06s | 0.63% | runtime.stkbucket |
0.04s | 0.42% | 93.46% | 4.84s | 51.05% | github.com/google/flatbuffers/benchmark/flat.(*PolygonT).Pack |
0.03s | 0.32% | 93.78% | 0.28s | 2.95% | github.com/google/flatbuffers/go.(*Builder).PrependUOffsetT |
0.02s | 0.21% | 93.99% | 0.06s | 0.63% | runtime.sweepone |
0.02s | 0.21% | 94.20% | 0.46s | 4.85% | github.com/google/flatbuffers/benchmark/flat.VerticesEnd |
0.02s | 0.21% | 94.41% | 0.18s | 1.90% | github.com/google/flatbuffers/benchmark/flat.VerticesAddY | (inline)
0.01s | 0.11% | 94.51% | 0.64s | 6.75% | runtime.newobject |
0.01s | 0.11% | 94.62% | 1.72s | 18.14% | runtime.mallocgc |
0.01s | 0.11% | 94.73% | 0.96s | 10.13% | runtime.gcAssistAlloc1 |
0.01s | 0.11% | 94.83% | 0.48s | 5.06% | runtime.bulkBarrierPreWriteSrcOnly |
0.01s | 0.11% | 94.94% | 0.22s | 2.32% | github.com/google/flatbuffers/go.(*Builder).PlaceUOffsetT | (inline)
0 | 0.00% | 94.94% | 0.44s | 4.64% | runtime.wbBufFlush.func1 |
0 | 0.00% | 94.94% | 0.44s | 4.64% | runtime.wbBufFlush |
0 | 0.00% | 94.94% | 2.55s | 26.90% | runtime.systemstack |
0 | 0.00% | 94.94% | 0.21s | 2.22% | runtime.sysmon |
0 | 0.00% | 94.94% | 0.06s | 0.63% | runtime.profilealloc |
0 | 0.00% | 94.94% | 0.20s | 2.11% | runtime.nanotime | (inline)
0 | 0.00% | 94.94% | 0.21s | 2.22% | runtime.mstart1 |
0 | 0.00% | 94.94% | 0.25s | 2.64% | runtime.mstart |
0 | 0.00% | 94.94% | 0.19s | 2.00% | runtime.mallocgc.func1 |
0 | 0.00% | 94.94% | 0.39s | 4.11% | runtime.makeslice |
0 | 0.00% | 94.94% | 8.47s | 89.35% | runtime.main |
0 | 0.00% | 94.94% | 0.06s | 0.63% | runtime.mProf_Malloc |
0 | 0.00% | 94.94% | 0.19s | 2.00% | runtime.largeAlloc |
0 | 0.00% | 94.94% | 0.05s | 0.53% | runtime.heapBits.initSpan |
0 | 0.00% | 94.94% | 2.80s | 29.54% | runtime.growslice |
0 | 0.00% | 94.94% | 0.95s | 10.02% | runtime.gcDrainN |
0 | 0.00% | 94.94% | 0.70s | 7.38% | runtime.gcDrain |
0 | 0.00% | 94.94% | 0.70s | 7.38% | runtime.gcBgMarkWorker.func2 |
0 | 0.00% | 94.94% | 0.71s | 7.49% | runtime.gcBgMarkWorker |
0 | 0.00% | 94.94% | 0.96s | 10.13% | runtime.gcAssistAlloc.func1 |
0 | 0.00% | 94.94% | 0.96s | 10.13% | runtime.gcAssistAlloc |
0 | 0.00% | 94.94% | 0.12s | 1.27% | runtime.(*mspan).nextFreeIndex |
0 | 0.00% | 94.94% | 0.27s | 2.85% | runtime.(*mheap).allocSpan |
0 | 0.00% | 94.94% | 0.27s | 2.85% | runtime.(*mheap).alloc.func1 |
0 | 0.00% | 94.94% | 0.36s | 3.80% | runtime.(*mheap).alloc |
0 | 0.00% | 94.94% | 0.24s | 2.53% | runtime.(*mcentral).grow |
0 | 0.00% | 94.94% | 0.32s | 3.38% | runtime.(*mcentral).cacheSpan |
0 | 0.00% | 94.94% | 0.32s | 3.38% | runtime.(*mcache).refill |
0 | 0.00% | 94.94% | 0.44s | 4.64% | runtime.(*mcache).nextFree |
0 | 0.00% | 94.94% | 0.78s | 8.23% | math/rand.(*Rand).Int63 | (inline)
0 | 0.00% | 94.94% | 8.47s | 89.35% | main.main |
0 | 0.00% | 94.94% | 0.21s | 2.22% | github.com/google/flatbuffers/go.WriteUOffsetT | (inline)
0 | 0.00% | 94.94% | 2.59s | 27.32% | github.com/google/flatbuffers/go.(*Builder).growByteBuffer | (inline)
0 | 0.00% | 94.94% | 0.08s | 0.84% | github.com/google/flatbuffers/go.(*Builder).PrependSOffsetT |
0 | 0.00% | 94.94% | 0.44s | 4.64% | github.com/google/flatbuffers/go.(*Builder).EndObject | (inline)
0 | 0.00% | 94.94% | 0.12s | 1.27% | github.com/google/flatbuffers/benchmark/flat.VerticesStart | (inline)
0 | 0.00% | 94.94% | 4.84s | 51.05% | github.com/google/flatbuffers/benchmark/flat.(*LayerT).Pack |
ignore main.FlatGenerate, this function is prepare testing data
Flat | Flat% | Sum% | Cum | Cum% | Name | Inlined?
-- | -- | -- | -- | -- | -- | --
845.48MB | 62.05% | 62.05% | 845.48MB | 62.05% | main.FlatGenerate |
388.62MB | 28.52% | 90.57% | 388.62MB | 28.52% | github.com/google/flatbuffers/go.(*Builder).growByteBuffer | (inline) =====================> ******** MARK 2
128MB | 9.39% | 99.96% | 516.62MB | 37.91% | github.com/google/flatbuffers/benchmark/flat.(*PolygonT).Pack |
0 | 0.00% | 99.96% | 1362.09MB | 99.96% | runtime.main |
0 | 0.00% | 99.96% | 1362.09MB | 99.96% | main.main |
0 | 0.00% | 99.96% | 388.62MB | 28.52% | github.com/google/flatbuffers/go.(*Builder).PrependFloat64Slot |
0 | 0.00% | 99.96% | 388.62MB | 28.52% | github.com/google/flatbuffers/go.(*Builder).PrependFloat64 |
0 | 0.00% | 99.96% | 388.62MB | 28.52% | github.com/google/flatbuffers/go.(*Builder).Prep |
0 | 0.00% | 99.96% | 388.62MB | 28.52% | github.com/google/flatbuffers/benchmark/flat.VerticesAddX | (inline)
0 | 0.00% | 99.96% | 388.62MB | 28.52% | github.com/google/flatbuffers/benchmark/flat.(*VerticesT).Pack |
0 | 0.00% | 99.96% | 516.62MB | 37.91% | github.com/google/flatbuffers/benchmark/flat.(*LayerT).Pack |
flatc go encoder ( *Builder)
Pre-allocated memory 388.62MB
Actual memory used 128MB
func (b *Builder) Prep(size, additionalBytes int) {
// Track the biggest thing we've ever aligned to.
if size > b.minalign {
b.minalign = size
}
// Find the amount of alignment needed such that `size` is properly
// aligned after `additionalBytes`:
alignSize := (^(len(b.Bytes) - int(b.Head()) + additionalBytes)) + 1
alignSize &= (size - 1)
// Reallocate the buffer if needed:
for int(b.head) <= alignSize+size+additionalBytes {
oldBufSize := len(b.Bytes)
b.growByteBuffer() // ==============================> ******* MARK 3
b.head += UOffsetT(len(b.Bytes) - oldBufSize)
}
b.Pad(alignSize)
}
func (b *Builder) growByteBuffer() {
if (int64(len(b.Bytes)) & int64(0xC0000000)) != 0 {
panic("cannot grow buffer beyond 2 gigabytes")
}
newLen := len(b.Bytes) * 2 // ==========================> ********** MARK 4
if newLen == 0 {
newLen = 1
}
if cap(b.Bytes) >= newLen {
b.Bytes = b.Bytes[:newLen]
} else {
extension := make([]byte, newLen-len(b.Bytes))
b.Bytes = append(b.Bytes, extension...)
}
middle := newLen / 2
copy(b.Bytes[middle:], b.Bytes[:middle]) // ===============> ******* MARK 5
}
check CPU pprof, MARK 1 , go runtime.memmove internal function called by copy() , MARK 5, that is to much copy() been called --------- so, it's slow down.
check MEM pprof, MARK 2 , flatc use double pre-allocated memory MARK 4, and copy old binary data to new allocate memory.
so, look like a probrem that make flatc go encoder slow down.
i have some idea but not a perfect solution, so, any suggestion / solution is welcome.
@rw , @somombo
Quick thoughts:
1) Have you tried re-using the FlatBufferBuilder using its Reset
function? That will re-use the scratch space, which makes it "warm". Right now, you are using a builder that has no scratch space, so it must grow it each time.
2) Perhaps more importantly, I'll note that the codes in this discussion provided by @tsingson use the newer "object" FlatBuffers API (which is more ergonomic, but requires allocations). In my experience with golang, the many small allocations you are making, one for each of your "Vertices" type, probably puts a lot of memory pressure on golang's GC. Here is the code I'm referring to:
vlist := make([]*flat.VerticesT, 0)
for _, v := range in {
vl := &flat.VerticesT{X: v.X, Y: v.Y}
vlist = append(vlist, vl)
}
My understanding is that this code allocates 8 1024 1024 VerticesT objects on the heap. It may be better to allocate them using the flat (traditional) FlatBuffers types used in the initial post: "Vertices", not "VerticesT".
3) Finally, given that @debabratadebroy is serializing three million Polygon objects in the same FlatBuffer, and those objects are Tables (in the FlatBuffers nomenclature), it's possible that our vtable deduplicator algorithm is causing you problems. Therefore, the next experiment for one of you to run is to see how long serialization takes with increasing numbers of Polygons being serialized: 1, 2, 4, 8, 16, ..., 8 1024 1024. If the serialization takes much more than linear time, then this is the culprit and we can suggest ways to work around it. (In Rust, we've had interest in providing a pluggable deduplicator implementation, and it might make sense to add that to Go, too).
I can do it for sure and share you the details
Just attaching few analysis already i did for 1000 to 3M polygons with MeasurementForgit.xlsx
different set of vertices
@rw , thanks for your comments.
Quick thoughts:
- Have you tried re-using the FlatBufferBuilder using its
Reset
function? That will re-use the scratch space, which makes it "warm". Right now, you are using a builder that has no scratch space, so it must grow it each time.
------- yes. it's should be do like this.
- Perhaps more importantly, I'll note that the codes in this discussion provided by @tsingson use the newer "object" FlatBuffers API (which is more ergonomic, but requires allocations). In my experience with golang, the many small allocations you are making, one for each of your "Vertices" type, probably puts a lot of memory pressure on golang's GC. Here is the code I'm referring to:
vlist := make([]*flat.VerticesT, 0) for _, v := range in { vl := &flat.VerticesT{X: v.X, Y: v.Y} vlist = append(vlist, vl) }
My understanding is that this code allocates 8 1024 1024 VerticesT objects on the heap. It may be better to allocate them using the flat (traditional) FlatBuffers types used in the initial post: "Vertices", not "VerticesT".
------- good point, i should improve the test data generate function ( read test data from a file........ ) , and use traditional "Vertices" but not "VerticesT"
- Finally, given that @debabratadebroy is serializing three million Polygon objects in the same FlatBuffer, and those objects are Tables (in the FlatBuffers nomenclature), it's possible that our vtable deduplicator algorithm is causing you problems. Therefore, the next experiment for one of you to run is to see how long serialization takes with increasing numbers of Polygons being serialized: 1, 2, 4, 8, 16, ..., 8 1024 1024. If the serialization takes much more than linear time, then this is the culprit and we can suggest ways to work around it. (In Rust, we've had interest in providing a pluggable deduplicator implementation, and it might make sense to add that to Go, too).
------- i will try a serious test again.
more detail here
i use flatc --proto to transfer .proto to .fbs:
syntax = "proto3";
package proto;
message Vertices {
double x = 1;
double y = 2;
}
message Polygon {
repeated Vertices polygons = 1;
}
message Layer {
repeated Polygon polygon = 1;
}
message Vertices in protoc's message should turn to flatc's struct.
Just attaching few analysis already i did for 1000 to 3M polygons with MeasurementForgit.xlsx
different set of vertices
thanks for share.
i 'm try more serious test and try pprof for sure.
Can you share the numbers in plaintext, instead of a spreadsheet?
@debabratadebroy All of those times seem pretty fast, could you point out where the slowness is? Also, it would be more convenient if you included the text in a comment here instead of attaching it as a file.
If you see in every case flatbuffer serialization time is quite high compared to protobuf.
1000000 5 proto takes 143.342215ms for serialization and 250.809553 ms for de-serialization while flatbuffer took 428.257128 ms 3000000 10 proto takes 526.430084 737.654052 while flatbuffer took 2635.236248ms
And more surprisingly in java i see better result with flatbuffer. But with go results are very high
@debabratadebroy What I see in your text file is this:
Polygon size,Num of vertices,Seri Time prot,Deseri Time proto,Seri Time flat,Deseri Time flat
10000,5,1.810929,2.73897,4.840526, 630ns
10000,10,2.699907,2.865603,8.128324, 621ns
10000,20,3.178384,3.791683,13.805347, 499ns
100000,5,17.988453,29.931799,54.262156, 1.479æs
100000,10,20.953287,24.290156,89.144263, 731ns
100000,20,26.784921,35.707918,150.023134, 701ns
500000,5,85.069564,102.150938,205.527666, 759ns
500000,10,95.372696,125.155915,374.518318, 668ns
500000,20,142.959145,171.535879,699.386822, 788ns
1000000,5,143.342215,250.809553,428.257128, 731ns
1000000,10,175.240556,254.896566,713.370745, 720ns
1000000,20,280.443402,351.806425,1150, 762ns
3000000,5,443.883073,658.381558,1551.458035, 837ns
3000000,10,526.430084,737.654052,2635.236248, 813ns
3000000,20,724.353883,1026.814232,3518.614021, 957ns
5000000,5,714.807835,1274.749182,2014.221521, 1.006æs
5000000,10,979.767504,1512.634871,3623.323316, 970ns
5000000,20,1452.606876,1687.081408,,
Perhaps it's an encoding issue with your spreadsheet, but I see mostly nanosecond (ns) measurements. Is this correct?
Edit: I was confused by the header. Are the timing measurements in milliseconds, except for the last field in each row?
Yes you are right , Sorry to plot a graph actually I have removed milliseconds. Last parameter is about de-serialization of flatbuffer which is in nano second. I dont have issue with it.
This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.
Is this still the case? Don't want to shoot myself in the foot by trying flatbuffers in go if protobuf is faster
Env: Golang
Version : go version go1.13.3 linux/amd64
With following IDL files my intention is to measure the serialization speed of Flatbuffer . I am using golang for my analysis
Here is the code I have written for calculation