Closed oridb closed 4 months ago
There is already an unboxed accessor addBuffer(float value)
, but I guess the problem is that we are storing repeated fields in List<Float>
internally and that requires boxing regardless how the API looks like.
Yes, I'm already using that API.
As you mentioned, it still boxes values. And it's just a bit clunky compared to just adding everything in one go. It's not a deal breaker, but improving performance on large primitive buffers would be nice. And having a simple API for it would be a pretty good bonus.
I have the same issue.
We are returning a message of "repeated int32". Protobuf isn't using primitive and the boxing of each element generate a lot of gc. It starts to have a noticable impact when we reach a big amount of request per second.
Do we have a hotfix/alternative ?
+1
+1
Hi all,
I have the same problem and end up turning down the adoption due to this issue. I've made all the developments to migrate from an RMI style communication to a GRPC and the memory usage of my service jumped from ~20Gb to ~60Gb. doubles[] are much more concise than List
+1
I see you did a big work on integrating collection of unboxed primitives to generated protobuf messages. That is great! But Builder are still using boxed types. So I am +1 for this
we meet performance issue while receiving large response from big data services, such as ranking service, output large amount of scores (storage as List
Boxed primitives with collection cause high memory and cpu overhead.
We really should be able to do this with primitive arrays internally whatever the API looks like. Making a note to myself to investigate further. Does this only happen with floating point values or also with integers?
This becomes a problem for Java GC especially for Ranker NN that requires float and int vectors. At very minimum it's huge waste to take in primitive array of floats and box each one into Float list. Then serialize that. For ranking this translates into 50k or more Float objects for GC to collect. At high QPS this becomes visible enough for us to consider a custom protocol instead of proto message. import org.tensorflow.example.Features; import org.tensorflow.example.FloatList; Int64List etc. Really need FloatList and Int64List wrappers, that take (primitive array, start index, length) and avoid Boxing into objects to be thrown away immediately after proto is built.
This would require a Tier 2, possibly Tier 1, review.
The big question here is, as with all optimizations, does it matter? Can any significant impact be shown that justifies the effort of improving this? Base don the above comments, I think the answer is yes. However this is going to take non-trivial effort and convincing multiple folks, so if anyone can provide a more complete write-up of the measurements they've done and how much this is costing their app, that would be very helpful.
Hey, @elharo! I would like to help. What kind of project do you think would be helpful? Would jmh benchmarks and visualvm profiles help?
For instance:
message ValuesBytes {
bytes values = 1;
}
message ValuesRepeatedFloats {
repeated float values = 1;
}
getValuesAsBytes: works with ByteString
, converts it to a primitive float array and then accesses each value.
getValuesAsFloats: works with repeated float
and uses getValues
to access each value.
Benchmark Mode Cnt Score Error Units
FloatBenchmark.getValuesAsBytes avgt 5 107.403 ± 3.705 ms/op
FloatBenchmark.getValuesAsBytes:·gc.alloc.rate avgt 5 5.108 ± 0.157 MB/sec
FloatBenchmark.getValuesAsBytes:·gc.alloc.rate.norm avgt 5 843776.480 ± 9736.395 B/op
FloatBenchmark.getValuesAsBytes:·gc.count avgt 5 ≈ 0 counts
FloatBenchmark.getValuesAsRepeatedFloats avgt 5 113.880 ± 23.573 ms/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.alloc.rate avgt 5 10.048 ± 2.066 MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.alloc.rate.norm avgt 5 1759073.476 ± 9518.489 B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Eden_Space avgt 5 11.134 ± 95.865 MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Eden_Space.norm avgt 5 2050548.622 ± 17655842.045 B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Survivor_Space avgt 5 1.012 ± 8.715 MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Survivor_Space.norm avgt 5 186413.511 ± 1605076.550 B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.count avgt 5 1.000 counts
FloatBenchmark.getValuesAsRepeatedFloats:·gc.time avgt 5 11.000 ms
Where I'd like to start is with a compelling real world application that shows this as a bottleneck, not simply a microbenchmark. Large arrays of primitives aren't that common outside of scientific and engineering applications so it's possible this just doesn't come up that often. However if we could find an existing large app that is burning significant CPU on this that might be convincing. The ideal scenario would be a large GCP customer that could quantify how much of their monthly spend is going to this. Or an Android app that could show NNN miliiseconds of app startup time is spent here.
Real world application is Maps ranking. High traffic, not a small application. I can't disclose actual numbers, however the limitation of boxing /unboxing of features from floating point native arrays and into Lists of objects is a clear overhead and it creates work for GC, the more often that runs the slower overall system. For one TFRanking call, the input range is on the order of 200 features, by 30 to 60 documents. Give or take the input is 10000 floats on each call.
Where I'd like to start is with a compelling real world application that shows this as a bottleneck, not simply a microbenchmark. Large arrays of primitives aren't that common outside of scientific and engineering applications so it's possible this just doesn't come up that often. However if we could find an existing large app that is burning significant CPU on this that might be convincing. The ideal scenario would be a large GCP customer that could quantify how much of their monthly spend is going to this. Or an Android app that could show NNN miliiseconds of app startup time is spent here.
It has been years and I m no more working on this project, but if we chose protobuf and not REST at this time it was because of speed and performance. Definitely, the garbage collector had a big impact on the performance in our case and it was the bottleneck to greater throughput.
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.
This issue is labeled inactive
because the last activity was over 90 days ago.
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it.
This issue was closed and archived because there has been no new activity in the 14 days since the inactive
label was added.
I would like to have an API for primitive types of repeated fields, in order to efficiently set them. The reason I need this is that I am transferring a large array of floats (representing the pixels of a thermal image), and would be happier if there was no need to box each float before adding it to the protobuf.
Ideally, the protobuf would just provide a function
addAllBuffer(float[] data)
in addition to the current iterator based APIs. Currently, all of the APIs for adding an array involve boxing the data at some point or another.The system setup is a little bit odd, since the grpc server is actually running on an android device, but I think this request applies in general.
Should this be an issue in the gRPC issue tracker?
This is a new feature request. Also, I was asked to file an issue by @ejona86.
What version of gRPC and what language are you using?
I am using Java. The desired API is not present in all versions, as far as I am aware.
What operating system (Linux, Windows, …) and version?
Android (Linux), targeting Android 6.
Yes, I would also like to switch to a newer version.
What runtime / compiler are you using (e.g. python version or version of gcc)
Android Studio, with current versions of the toolchain.
What did you expect to see?
An efficient, boxing-free, method of adding a large array of primitives to a protobuf.
What did you see instead?
http://cdn3.meme.am/cache/instances/folder113/500x/63531113/x-x-everywhere-boxing-boxing-everywhere.jpg