There is no Java API for adding arrays of primitives to repeated fields without boxing

oridb commented 7 years ago

I would like to have an API for primitive types of repeated fields, in order to efficiently set them. The reason I need this is that I am transferring a large array of floats (representing the pixels of a thermal image), and would be happier if there was no need to box each float before adding it to the protobuf.

Ideally, the protobuf would just provide a function addAllBuffer(float[] data) in addition to the current iterator based APIs. Currently, all of the APIs for adding an array involve boxing the data at some point or another.

The system setup is a little bit odd, since the grpc server is actually running on an android device, but I think this request applies in general.

Should this be an issue in the gRPC issue tracker?

This is a new feature request. Also, I was asked to file an issue by @ejona86.

What version of gRPC and what language are you using?

I am using Java. The desired API is not present in all versions, as far as I am aware.

What operating system (Linux, Windows, …) and version?

Android (Linux), targeting Android 6.

Yes, I would also like to switch to a newer version.

What runtime / compiler are you using (e.g. python version or version of gcc)

Android Studio, with current versions of the toolchain.

What did you expect to see?

An efficient, boxing-free, method of adding a large array of primitives to a protobuf.

What did you see instead?

http://cdn3.meme.am/cache/instances/folder113/500x/63531113/x-x-everywhere-boxing-boxing-everywhere.jpg

xfxyjwf commented 7 years ago

There is already an unboxed accessor addBuffer(float value), but I guess the problem is that we are storing repeated fields in List<Float> internally and that requires boxing regardless how the API looks like.

oridb commented 7 years ago

Yes, I'm already using that API.

As you mentioned, it still boxes values. And it's just a bit clunky compared to just adding everything in one go. It's not a deal breaker, but improving performance on large primitive buffers would be nice. And having a simple API for it would be a pretty good bonus.

noguespi commented 7 years ago

I have the same issue.

We are returning a message of "repeated int32". Protobuf isn't using primitive and the boxing of each element generate a lot of gc. It starts to have a noticable impact when we reach a big amount of request per second.

Do we have a hotfix/alternative ?

bolerio commented 6 years ago

+1

paolieri commented 6 years ago

+1

rbroggi commented 6 years ago

Hi all, I have the same problem and end up turning down the adoption due to this issue. I've made all the developments to migrate from an RMI style communication to a GRPC and the memory usage of my service jumped from ~20Gb to ~60Gb. doubles[] are much more concise than List. I'm a big +1 here... I am a big fan of protobuf hopefully we can achieve this requirement.

smovva commented 6 years ago

+1

trams commented 5 years ago

I see you did a big work on integrating collection of unboxed primitives to generated protobuf messages. That is great! But Builder are still using boxed types. So I am +1 for this

hitsmaxft commented 4 years ago

we meet performance issue while receiving large response from big data services, such as ranking service, output large amount of scores (storage as List ) with PB protocol.

Boxed primitives with collection cause high memory and cpu overhead.

elharo commented 3 years ago

We really should be able to do this with primitive arrays internally whatever the API looks like. Making a note to myself to investigate further. Does this only happen with floating point values or also with integers?

vitalyli commented 3 years ago

This becomes a problem for Java GC especially for Ranker NN that requires float and int vectors. At very minimum it's huge waste to take in primitive array of floats and box each one into Float list. Then serialize that. For ranking this translates into 50k or more Float objects for GC to collect. At high QPS this becomes visible enough for us to consider a custom protocol instead of proto message. import org.tensorflow.example.Features; import org.tensorflow.example.FloatList; Int64List etc. Really need FloatList and Int64List wrappers, that take (primitive array, start index, length) and avoid Boxing into objects to be thrown away immediately after proto is built.

elharo commented 2 years ago

This would require a Tier 2, possibly Tier 1, review.

The big question here is, as with all optimizations, does it matter? Can any significant impact be shown that justifies the effort of improving this? Base don the above comments, I think the answer is yes. However this is going to take non-trivial effort and convincing multiple folks, so if anyone can provide a more complete write-up of the measurements they've done and how much this is costing their app, that would be very helpful.

wnleao commented 2 years ago

Hey, @elharo! I would like to help. What kind of project do you think would be helpful? Would jmh benchmarks and visualvm profiles help?

For instance:

message ValuesBytes {
    bytes values = 1;
}

message ValuesRepeatedFloats {
    repeated float values = 1;
}

getValuesAsBytes: works with ByteString, converts it to a primitive float array and then accesses each value.

getValuesAsFloats: works with repeated float and uses getValues to access each value.

Benchmark                                                                  Mode  Cnt        Score          Error   Units
FloatBenchmark.getValuesAsBytes                                            avgt    5      107.403 ±        3.705   ms/op
FloatBenchmark.getValuesAsBytes:·gc.alloc.rate                             avgt    5        5.108 ±        0.157  MB/sec
FloatBenchmark.getValuesAsBytes:·gc.alloc.rate.norm                        avgt    5   843776.480 ±     9736.395    B/op
FloatBenchmark.getValuesAsBytes:·gc.count                                  avgt    5          ≈ 0                 counts
FloatBenchmark.getValuesAsRepeatedFloats                                   avgt    5      113.880 ±       23.573   ms/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.alloc.rate                    avgt    5       10.048 ±        2.066  MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.alloc.rate.norm               avgt    5  1759073.476 ±     9518.489    B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Eden_Space           avgt    5       11.134 ±       95.865  MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Eden_Space.norm      avgt    5  2050548.622 ± 17655842.045    B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Survivor_Space       avgt    5        1.012 ±        8.715  MB/sec
FloatBenchmark.getValuesAsRepeatedFloats:·gc.churn.G1_Survivor_Space.norm  avgt    5   186413.511 ±  1605076.550    B/op
FloatBenchmark.getValuesAsRepeatedFloats:·gc.count                         avgt    5        1.000                 counts
FloatBenchmark.getValuesAsRepeatedFloats:·gc.time                          avgt    5       11.000                     ms

elharo commented 2 years ago

Where I'd like to start is with a compelling real world application that shows this as a bottleneck, not simply a microbenchmark. Large arrays of primitives aren't that common outside of scientific and engineering applications so it's possible this just doesn't come up that often. However if we could find an existing large app that is burning significant CPU on this that might be convincing. The ideal scenario would be a large GCP customer that could quantify how much of their monthly spend is going to this. Or an Android app that could show NNN miliiseconds of app startup time is spent here.

vitalyli commented 2 years ago

Real world application is Maps ranking. High traffic, not a small application. I can't disclose actual numbers, however the limitation of boxing /unboxing of features from floating point native arrays and into Lists of objects is a clear overhead and it creates work for GC, the more often that runs the slower overall system. For one TFRanking call, the input range is on the order of 200 features, by 30 to 60 documents. Give or take the input is 10000 floats on each call.

noguespi commented 2 years ago

Where I'd like to start is with a compelling real world application that shows this as a bottleneck, not simply a microbenchmark. Large arrays of primitives aren't that common outside of scientific and engineering applications so it's possible this just doesn't come up that often. However if we could find an existing large app that is burning significant CPU on this that might be convincing. The ideal scenario would be a large GCP customer that could quantify how much of their monthly spend is going to this. Or an Android app that could show NNN miliiseconds of app startup time is spent here.

see my previous comment

It has been years and I m no more working on this project, but if we chose protobuf and not REST at this time it was because of speed and performance. Definitely, the garbage collector had a big impact on the performance in our case and it was the bottleneck to greater throughput.

github-actions[bot] commented 5 months ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

github-actions[bot] commented 4 months ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it.

This issue was closed and archived because there has been no new activity in the 14 days since the inactive label was added.

protocolbuffers / protobuf