Closed GoogleCodeExporter closed 9 years ago
Since protobuf is a binary encoding, the representation of strings is an
implementation detail. This is explicitly utf-8 in the google specification.
Please can you be specific about some scenario where UTF-8 is a problem? In
particular, I am not aware of a scenarion in .net that doesn't have UTF-8
available.
If you (for some reason) need to encode a protobuf as a string, use base-64; it
is incorrect to use an encoding to do this.
Essentially, I do not understand the issue where you would want a different
encoding here. It applies to XmlSerializer because XML is a text-based
serializer. Protobuf is not. Encoding is unrelated.
Original comment by marc.gravell
on 16 Jun 2011 at 3:36
Case: I need to exchange data with non-.NET devices (example: java based
POS-terminal, C++ based card readers, etc). Some devices do not support UTF-8,
but only plain ASCII. I have .NET service connected to several countries, like
Russia, Malaysia, China, etc. In each country, devices use national encoding
(ex. to print receipts) and send me text fields in national encoding either. I
know encoding of each message, so I like to convert them to .NET strings
correctly using Encoding.GetString() function. The most obvious way to
implement it is to pass System.Text.Encoding param to serialize/deserialize
methods (like XmlSerializer does).
The possible workaround to declare all my strings as bytes[] and decode-encode
manually. But this is really ugly approach and I have a hundreds of classes!
Patching your library manually is either not really handy as I'll have to do it
each time library updated. So please make Encoding a param instead of hardcode.
Thanks.
Original comment by Anton.Kr...@gmail.com
on 17 Jun 2011 at 11:28
I'm confused... if they only support ASCII, then: just use strings with ASCII
characters in them. That will be 100% identical when encoded via either ASCII
or UTF-8.
You then say: "devices use national encoding" - but then; that isn't ASCII.
That sounds like code-page encoding (which is different).
The problem is; the protobuf wire format explicitly states UTF-8. Every
protobuf implementation will therefore *expect* UTF-8. If I give you something
using an arbitrary encoding, I'm unclear how you will decode that. What library
do you intend using to *decode* this? Does *that* support different encodings?
If it doesn't, then my encoding it ***won't help you***...
Please clarify.
Original comment by marc.gravell
on 17 Jun 2011 at 11:34
[deleted comment]
Right, saying ASCII I mean "extended ASCII", one byte national encodings, not
ASCII itself. It's confusing, but I'm talking about national encodings. Going
further, not essentially one-byte national encodings, as in China they might
have some sort of unicode (different then utf-8).
As far as in google buffers protocol string is a varint encoded length, so
google protocol itself does not care about string encoding. They are just a
bytes. Concerning device side libraries compatibility problems:
1. they might be custom (we develop them ourselves), or
2. device side code might declare strings as bytes.
Anyway, on server side I should have an option to specify encoding to
serialize/deserialize string fields in compatible way.
Original comment by Anton.Kr...@gmail.com
on 17 Jun 2011 at 12:33
I'm going to refer to
http://code.google.com/apis/protocolbuffers/docs/proto.html
"string | A string must always contain UTF-8 encoded or 7-bit ASCII text."
so I disagree that this is something that the server *should* try to support.
Doing so would actively violate the spec (I have no problem adding extra
features *in a spec-compatible* way, but this ... isn't).
Anything I do here would really require opt-in at a per-member level, otherwise
it really is asking for problems (and sadly, it is me that would then have to
deal with those problems when other people use it without realising the
inherent problems in what you suggest). Would you find it reasonable to do
such? i.e. indicate on a per-member basis that it can use the non-UTF8 encoder?
Then presumably we would specify the encoding in the Serialize (etc) call.
Original comment by marc.gravell
on 17 Jun 2011 at 12:56
I'm also changing this from "defect" to "enhancement"; adhering to the formal
spec is not a defect.
Original comment by marc.gravell
on 17 Jun 2011 at 12:57
Per-member indication is not a solution, because I have to specify encoding
per-request, but for any string fields. And I see no problem to other people to
add to Serialize/Deserialize methods an overload with additional Encoding
param. Anyway default will stay UTF-8 and only those who heed this 'out of
spec' option will change this param.
Ok, it's not a harm for me to patch code manually, as far a I met first
non-utf8 device. In general you are right, this is really out of google spec.
Original comment by Anton.Kr...@gmail.com
on 17 Jun 2011 at 1:57
Original comment by marc.gravell
on 25 Jun 2011 at 9:52
Yes, I have gone into the same problems with Russian systems as well. I mean,
with encodings, when string is not encoded ASCII or UTF-8, but rather CP1251.
Original comment by gabriel...@gabrielius.net
on 1 Jul 2014 at 5:27
There could be some helper methods/parameters that could work-around this
issue, even it is something more than a specification.
Original comment by gabriel...@gabrielius.net
on 1 Jul 2014 at 5:30
It is a feature of the protobuf wire specification that strings are encoded in
utf-8. If you use a different encoding then it is not compliant protobuf data.
The important thing is surely that you get back the data you started with.
Utf-8 guranatres that.
What would be the purpose of this change?
Original comment by marc.gravell
on 1 Jul 2014 at 5:43
I completely agree with your statements.
The main purpose, is that currently I am working with a remote Russian system
(third-party system) I am not in charge with, so I cannot change how the
messages are sent to me. However, I need to get the info and display it
correctly. That system encodes data in CP1251 and not in UTF-8. And the only
way I could get the strings represented nicely is to specify custom encoding,
e.g. codepage 1251. All the strings are encoded in CP1251.
I can attach the proto-object serialized into file and the description of the
proto-object and you could see yourself. Thanks.
Original comment by gabriel...@gabrielius.net
on 1 Jul 2014 at 6:10
The best thing I can recommend here is: instead of
[ProtoMember(5)]
public string Foo {get;set;}
you use:
[ProtoMember(5)]
public byte[] FooBinary {
get { return Foo == null ? null : someEncoding.GetBytes(Foo); }
set { Foo = value == null ? null : someEncoding.GetString(value); }
}
public string Foo {get;set;}
would that work? Or are there lots of strings involved?
My point is: if somebody was sending malformed xml, it is expected that xml
serializers reject it. If somebody sends malformed json, it is expected that
json serializers reject it. The only correct encoding for strings in the
protobuf specification is: utf-8.
Original comment by marc.gravell
on 1 Jul 2014 at 6:51
Instead of
[ProtoMember(5)]
public string Foo {get;set;}
I used
[ProtoMember(5)]
public byte[] Foo {get;set;}
and dealt with encoding outside and it worked, thanks. Can be a workaround
definitely.
There are lot's of strings, though some use ASCII and others CP1251. I haven't
used all the proto-file objects yet, since the generated file takes ~5500
lines, so hard to tell if going field by field will suffice.
Anyways, let's say I have lots of strings in CP1251 and want to make a general
change in protobuf-net code (which I fork), so that I don't go field by field.
Where should start? StringSerializer.cs?
Original comment by gabriel...@gabrielius.net
on 2 Jul 2014 at 9:32
I will just repeat my last two lines of previous post:
Anyways, let's say I have lots of strings in CP1251 and want to make a general
change in protobuf-net code (which I fork), so that I don't go field by field.
Where should start? StringSerializer.cs?
Original comment by gabriel...@gabrielius.net
on 7 Jul 2014 at 5:44
The encoding is actually used as part of ProtoReader (ReadString) and
ProtoWriter (WriteString).
Original comment by marc.gravell
on 8 Jul 2014 at 9:04
Original issue reported on code.google.com by
Anton.Kr...@gmail.com
on 16 Jun 2011 at 3:07