Feature request: need support of different string encodings

GoogleCodeExporter commented 9 years ago

Anton.Krupnov@gmail.com
==============

Unlike XmlSerializer, Serializer have no Encoding argument. It might be 
necessary for some applications, like mine. Not any device have UTF8 support 
out of the box. So, please, add Encoding support.

I've checked sources, in ProtoWriter you already have Encoding hardcoded to 
UTF8. Please make it non-static variable, and let it be set from Serializer 
param. Thanks a lot!

Original issue reported on code.google.com by Anton.Kr...@gmail.com on 16 Jun 2011 at 3:07

GoogleCodeExporter commented 9 years ago

Since protobuf is a binary encoding, the representation of strings is an 
implementation detail. This is explicitly utf-8 in the google specification.

Please can you be specific about some scenario where UTF-8 is a problem? In 
particular, I am not aware of a scenarion in .net that doesn't have UTF-8 
available.

If you (for some reason) need to encode a protobuf as a string, use base-64; it 
is incorrect to use an encoding to do this.

Essentially, I do not understand the issue where you would want a different 
encoding here. It applies to XmlSerializer because XML is a text-based 
serializer. Protobuf is not. Encoding is unrelated.

Original comment by marc.gravell on 16 Jun 2011 at 3:36

GoogleCodeExporter commented 9 years ago

Case: I need to exchange data with non-.NET devices (example: java based 
POS-terminal, C++ based card readers, etc). Some devices do not support UTF-8, 
but only plain ASCII. I have .NET service connected to several countries, like 
Russia, Malaysia, China, etc. In each country, devices use national encoding 
(ex. to print receipts) and send me text fields in national encoding either. I 
know encoding of each message, so I like to convert them to .NET strings 
correctly using Encoding.GetString() function. The most obvious way to 
implement it is to pass System.Text.Encoding param to serialize/deserialize 
methods (like XmlSerializer does).

The possible workaround to declare all my strings as bytes[] and decode-encode 
manually. But this is really ugly approach and I have a hundreds of classes! 
Patching your library manually is either not really handy as I'll have to do it 
each time library updated. So please make Encoding a param instead of hardcode. 
Thanks.

Original comment by Anton.Kr...@gmail.com on 17 Jun 2011 at 11:28

GoogleCodeExporter commented 9 years ago

I'm confused... if they only support ASCII, then: just use strings with ASCII 
characters in them. That will be 100% identical when encoded via either ASCII 
or UTF-8.

You then say: "devices use national encoding" - but then; that isn't ASCII. 
That sounds like code-page encoding (which is different).

The problem is; the protobuf wire format explicitly states UTF-8. Every 
protobuf implementation will therefore *expect* UTF-8. If I give you something 
using an arbitrary encoding, I'm unclear how you will decode that. What library 
do you intend using to *decode* this? Does *that* support different encodings? 
If it doesn't, then my encoding it ***won't help you***...

Please clarify.

Original comment by marc.gravell on 17 Jun 2011 at 11:34

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Right, saying ASCII I mean "extended ASCII", one byte national encodings, not 
ASCII itself. It's confusing, but I'm talking about national encodings. Going 
further, not essentially one-byte national encodings, as in China they might 
have some sort of unicode (different then utf-8).   

As far as in google buffers protocol string is a varint encoded length, so 
google protocol itself does not care about string encoding. They are just a 
bytes. Concerning device side libraries compatibility problems:
1. they might be custom (we develop them ourselves), or 
2. device side code might declare strings as bytes. 
Anyway, on server side I should have an option to specify encoding to 
serialize/deserialize string fields in compatible way.

Original comment by Anton.Kr...@gmail.com on 17 Jun 2011 at 12:33

GoogleCodeExporter commented 9 years ago

I'm going to refer to 
http://code.google.com/apis/protocolbuffers/docs/proto.html

"string | A string must always contain UTF-8 encoded or 7-bit ASCII text."

so I disagree that this is something that the server *should* try to support. 
Doing so would actively violate the spec (I have no problem adding extra 
features *in a spec-compatible* way, but this ... isn't).

Anything I do here would really require opt-in at a per-member level, otherwise 
it really is asking for problems (and sadly, it is me that would then have to 
deal with those problems when other people use it without realising the 
inherent problems in what you suggest). Would you find it reasonable to do 
such? i.e. indicate on a per-member basis that it can use the non-UTF8 encoder? 
Then presumably we would specify the encoding in the Serialize (etc) call.

Original comment by marc.gravell on 17 Jun 2011 at 12:56

GoogleCodeExporter commented 9 years ago

I'm also changing this from "defect" to "enhancement"; adhering to the formal 
spec is not a defect.

Original comment by marc.gravell on 17 Jun 2011 at 12:57

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Per-member indication is not a solution, because I have to specify encoding 
per-request, but for any string fields. And I see no problem to other people to 
add to Serialize/Deserialize methods an overload with additional Encoding 
param. Anyway  default will stay UTF-8 and only those who heed this 'out of 
spec' option will change this param.

Ok, it's not a harm for me to patch code manually, as far a I met first 
non-utf8 device. In general you are right, this is really out of google spec.

Original comment by Anton.Kr...@gmail.com on 17 Jun 2011 at 1:57

GoogleCodeExporter commented 9 years ago

Original comment by marc.gravell on 25 Jun 2011 at 9:52

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Yes, I have gone into the same problems with Russian systems as well. I mean, 
with encodings, when string is not encoded ASCII or UTF-8, but rather CP1251.

Original comment by gabriel...@gabrielius.net on 1 Jul 2014 at 5:27

GoogleCodeExporter commented 9 years ago

There could be some helper methods/parameters that could work-around this 
issue, even it is something more than a specification.

Original comment by gabriel...@gabrielius.net on 1 Jul 2014 at 5:30

GoogleCodeExporter commented 9 years ago

It is a feature of the protobuf wire specification that strings are encoded in 
utf-8. If you use a different encoding then it is not compliant protobuf data. 
The important thing is surely that you get back the data you started with. 
Utf-8 guranatres that.

What would be the purpose of this change?

Original comment by marc.gravell on 1 Jul 2014 at 5:43

GoogleCodeExporter commented 9 years ago

I completely agree with your statements.

The main purpose, is that currently I am working with a remote Russian system 
(third-party system) I am not in charge with, so I cannot change how the 
messages are sent to me. However, I need to get the info and display it 
correctly. That system encodes data in CP1251 and not in UTF-8. And the only 
way I could get the strings represented nicely is to specify custom encoding, 
e.g. codepage 1251. All the strings are encoded in CP1251.

I can attach the proto-object serialized into file and the description of the 
proto-object and you could see yourself. Thanks.

Original comment by gabriel...@gabrielius.net on 1 Jul 2014 at 6:10

GoogleCodeExporter commented 9 years ago

The best thing I can recommend here is: instead of

    [ProtoMember(5)]
    public string Foo {get;set;}

you use:

    [ProtoMember(5)]
    public byte[] FooBinary {
        get { return Foo == null ? null : someEncoding.GetBytes(Foo); }
        set { Foo = value == null ? null : someEncoding.GetString(value); }
    }

    public string Foo {get;set;}

would that work? Or are there lots of strings involved?

My point is: if somebody was sending malformed xml, it is expected that xml 
serializers reject it. If somebody sends malformed json, it is expected that 
json serializers reject it. The only correct encoding for strings in the 
protobuf specification is: utf-8.

Original comment by marc.gravell on 1 Jul 2014 at 6:51

GoogleCodeExporter commented 9 years ago

Instead of
    [ProtoMember(5)]
    public string Foo {get;set;}

I used
    [ProtoMember(5)]
    public byte[] Foo {get;set;}

and dealt with encoding outside and it worked, thanks. Can be a workaround 
definitely.

There are lot's of strings, though some use ASCII and others CP1251. I haven't 
used all the proto-file objects yet, since the generated file takes ~5500 
lines, so hard to tell if going field by field will suffice.

Anyways, let's say I have lots of strings in CP1251 and want to make a general 
change in protobuf-net code (which I fork), so that I don't go field by field. 
Where should start? StringSerializer.cs?

Original comment by gabriel...@gabrielius.net on 2 Jul 2014 at 9:32

GoogleCodeExporter commented 9 years ago

I will just repeat my last two lines of previous post:

Anyways, let's say I have lots of strings in CP1251 and want to make a general 
change in protobuf-net code (which I fork), so that I don't go field by field. 
Where should start? StringSerializer.cs?

Original comment by gabriel...@gabrielius.net on 7 Jul 2014 at 5:44

GoogleCodeExporter commented 9 years ago

The encoding is actually used as part of ProtoReader (ReadString) and 
ProtoWriter (WriteString).

Original comment by marc.gravell on 8 Jul 2014 at 9:04

handoutz / protobuf-net

Feature request: need support of different string encodings #187