JSON encoding corrupts non-UTF-8 binary data

davidar commented 9 years ago

The Data field of ipfs object get --enc=json tries to decode UTF-8 binary data into a unicode string (rather than a base64 encoded blob). This causes problems with binary data that isn't valid UTF-8:

$ wget https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
$ ipfs add UTF-8-test.txt 
added QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M UTF-8-test.txt
$ ipfs object get --enc=json QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M
{
  "Links": [],
  "Data": "\u0008\u0002\u0012��\u0001UTF-8 decoder capability and stress test\n
<snip>
$ ipfs object get --enc=json QmVDcJvu7CiqeUeHhnkvej9UZu6PMDXxED5r2nuuzn2G7M | ipfs object put --inputenc=json
added QmegjSvPuqqZPu5PHuevWmj8rLoGADTo1dnfCpcNWs77Po

Note that the hashes don't match, as the json encoding has corrupted the data.

The problem seems to be line core/commands/object.go:202. I've tried fixing it myself, but I don't speak go. It looks like go's json module is supposed to do base64 conversion automatically, but it isn't working for me.

whyrusleeping commented 9 years ago

@davidar I'll take a look at this. Thanks for reporting!

chriscool commented 9 years ago

I will try to provide a test for this, but according to this page it looks like the UTF-8-test.txt file might have the following license:

https://creativecommons.org/licenses/by/4.0/

I will contact the author about this, but maybe it's acceptable to put it anyway into "test/sharness/t0051-object-data".

chriscool commented 9 years ago

The test in #1599 should help fix the issue even if we cannot merge it now.

martingms commented 9 years ago

Looked into this a bit, and it seems to me the problem is not with the JSON marshaller/unmarshaller. I found that the data of the object returned from the call to core.Resolve() (here) has extraneous data wrapping it.

It seems like the only reason ipfs get is not affected by this is because it is run through the archiver whether you ask for archiving or not, which probably trims or something.

The easiest way to replicate the error is as follows (since ipfs object data returns the data and nothing else, through the same mechanism as all other get-commands):

$ echo "hello world" > helloworld.txt
$ ipfs add helloworld.txt
added QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o helloworld.txt
$ ipfs object data QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o > actual
$ diff <(hexdump helloworld.txt) <(hexdump actual)
1,2c1,3
< 0000000 6568 6c6c 206f 6f77 6c72 0a64          
< 000000c
---
> 0000000 0208 0c12 6568 6c6c 206f 6f77 6c72 0a64
> 0000010 0c18                                   
> 0000012

So somewhere along the line those extra bytes are added to the data, or there's some encoding I've missed, and it's just not correctly stripped for the ipfs object {data,get,...} commands.

Looking in the blockstore of the repo:

$ hexdump 122046d4/122046d44814b9c5af141c3aaab7c05dc5e844ead5f91f12858b021eba45768b4c0e.data
0000000 120a 0208 0c12 6568 6c6c 206f 6f77 6c72
0000010 0a64 0c18                              
0000014

we see that those bytes are already there. So if these bytes are actually a bug and not intended, then they're probably being added by the add-call, and not on fetching.

This is my first dive into the code base, so I might have missed or misunderstood something however.

This is with go 1.5 by the way, haven't tested with anything else.

jbenet commented 9 years ago

@martingms i think what you're seeing is the unixfs protobuf: https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto which currently is stored in the data field of the ipfs object.

so, more clearly, we basically have something like:

# ipfs add
data := request.body
file := unixfs.NewFile(data)
obj := Node()
obj.Data = proto.Marshal(file)

# ipfs object data
obj := get(request.params.arg[0])
return obj.Data # with the serialized protobuf

# ipfs cat
obj := get(request.params.arg[0])
var file unixfs.File
file := proto.Unmarshal(obj.Data, &file)
return file.Data() # the raw data

tl;dr: i think this is fine.

martingms commented 9 years ago

Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json.

davidar commented 9 years ago

The problem is that the json encoder is interpreting the data field as a utf8 string, but it should be base64ing it instead. I don't think protobuf has anything to do with it

On Thu, 27 Aug 2015 12:14 Martin Gammelsæter notifications@github.com wrote:

Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json.

— Reply to this email directly or view it on GitHub https://github.com/ipfs/go-ipfs/issues/1582#issuecomment-135247956.

David A Roberts https://davidar.io

martingms commented 9 years ago

Hm okay. Looks like there are similar issues with the XML encoder if that's the case...

$ ipfs object get --enc=xml QmNY5sQeH9ttVCg24sizH71dNbcZTpGd7Yb3YwsKZ4jiFP
<Node><Data>������UTF-8 decoder capability and stress test
<snip>

but object putting with --inputenc=xml is not yet supported, so cannot verify.

chriscool commented 9 years ago

Just to clarify things.

@jbenet wrote:

tl;dr: i think this is fine.

Do you mean that it should be expected that the hash from ipfs add and from ipfs object put --inputenc=json are not the same?

Or do you just mean that it is fine that ipfs object data is displaying protobuf encoded data?

jbenet commented 9 years ago

Or do you just mean that it is fine that ipfs object data is displaying protobuf encoded data?

this o/

davidar commented 9 years ago

@martingms @chriscool This might fix it, but I haven't tested.

chriscool commented 9 years ago

@davidar I get the following test failure with your change:

ok 5 - 'ipfs object get' succeeds

expecting success: 
                test_cmp ../t0051-object-data/expected_getOut actual_getOut

> diff -u ../t0051-object-data/expected_getOut actual_getOut
--- ../t0051-object-data/expected_getOut        2015-06-12 14:04:03.680326279 +0000
+++ actual_getOut       2015-08-28 09:03:04.721395159 +0000
@@ -1,4 +1,4 @@
 {
   "Links": [],
-  "Data": "\u0008\u0002\u0012\nHello Mars\u0018\n"
+  "Data": "CAISCkhlbGxvIE1hcnMYCg=="
 }
\ No newline at end of file

davidar commented 9 years ago

@chriscool Then the test is broken, "CAISCkhlbGxvIE1hcnMYCg==" is the correct output

Edit: the fix for this bug will change the format of object get, so I would expect tests relying on the old (incorrect) format to break

chriscool commented 9 years ago

If I 'fix' the above test, then there is the following error:

expecting success: 
                ipfs object put  ../t0051-object-data/testPut.json > actual_putOut

Error: illegal base64 data at input byte 4
not ok 9 - 'ipfs object put file.json' succeeds
#
#                       ipfs object put  ../t0051-object-data/testPut.json > actual_putOut
#

davidar commented 9 years ago

@chriscool That test is also broken. The old behaviour was that Data was interpreted as a utf8-encoded unicode string. However, that doesn't match the protobuf schema, so it breaks on binary data that isn't valid utf8. So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="

@jbenet @whyrusleeping jump in if I'm saying anything incorrect

jbenet commented 9 years ago

So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="

could swear we had it that way, not sure why it's not... oh i think @whyrusleeping removed it so it would look pretty and be easy to edit. (which is true, is nicer to edit in plain text). just dont think we can do that in json land

davidar commented 9 years ago

@chriscool any further progress on this?

whyrusleeping commented 9 years ago

I think this one falls in my lap. I wanted the data fields to be easy to edit... But previously i thought that it did escapes like \uABCD or whatever.

slothbag commented 8 years ago

Hey guys, me and TrekDev have been discussing this issue.. whats the consensus here? Is ipfs object put suppose to just accept base64 and decode it internally on receive? Or will it continue to accept plain text?

whyrusleeping commented 8 years ago

I think the solution here might be to have different encoding types for these workflows. the json encoding should have base64 encoded Data field all the time, and then we might want to have another type (not sure on naming) that does json, but doesnt base64 encode the data for ease of use.

While we're on this topic, we should think about how ipld will handle this sort of thing...

davidar commented 8 years ago

While we're on this topic, we should think about how ipld will handle this sort of thing...

CBOR distinguishes between bytestrings and unicode (text) strings, so only the former would need to be base64'd when displayed as JSON.

Edit: but yeah, you'd need some way of indicating which strings are actually encoded bytestrings

thisconnect commented 7 years ago

Will this be fixed?

ydennisy commented 6 years ago

Any update on this?

magik6k commented 6 years ago

There is now an option to set data field encoding to base64: https://github.com/ipfs/go-ipfs/blob/master/core/commands/object/object.go#L389

magik6k commented 5 years ago

(use ipfs object get --data-encoding=base64 QmThing for blocks with binary data)

ipfs / kubo

JSON encoding corrupts non-UTF-8 binary data #1582