google-code-export / protobuf

Automatically exported from code.google.com/p/protobuf
Other
1 stars 1 forks source link

Java: CodedInputStream.readString() creates new String object #254

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

I've had in my java program the use case that a protobuf file
contained about a million repeated messages which contain
besides others a string attribute (as a "foreign key").

As I ran constantly out of memory I noticed by profiling my
application that there were about half a million distinct
String objects with the same content.

Here is the problematic method:

  public String readString() throws IOException {
    final int size = readRawVarint32();
    if (size <= (bufferSize - bufferPos) && size > 0) {
      final String result = new String(buffer, bufferPos, size, "UTF-8");
      bufferPos += size;
      return result;
    } else {
      return new String(readRawBytes(size), "UTF-8");
    }
  }

After adding a ".intern()" to the string my memory problems went away.

Please check if it's advisable to fix this issue by doing so in your next 
release.

Greetings,
Stefan Oltmann

Original issue reported on code.google.com by stefan.o...@gmail.com on 8 Feb 2011 at 11:06

GoogleCodeExporter commented 9 years ago
This is use-case dependent.
Using intern isn't always a good thing.
See 
http://stackoverflow.com/questions/2431540/garbage-collection-behaviour-for-stri
ng-intern

Original comment by david.yu...@gmail.com on 8 Feb 2011 at 11:17

GoogleCodeExporter commented 9 years ago
Ah, I see. Okay, maybe there is an option to implement an "intern()" in the 
generated "Builder mergeFrom()"-method that I could set in the the .proto-File 
for a specific attribute like I can do that with "deprecated"?

Original comment by stefan.o...@gmail.com on 8 Feb 2011 at 4:24

GoogleCodeExporter commented 9 years ago
intern() only works well if you have many different strings with the same 
values.  In that case, you should probably consider reworking your protocol.  
Perhaps you should be using an enum instead?  Or maybe you should have a 
separate string list that only stores one copy of each value, and then in other 
places you should store indexes into that list?

I don't think we want to add an option for this because the use case is too 
obscure.  Options have a cost in terms of system complexity, maintainability, 
and learnability.

Original comment by kenton@google.com on 8 Feb 2011 at 9:33