Closed davidar closed 5 years ago
@davidar I'll take a look at this. Thanks for reporting!
I will try to provide a test for this, but according to this page it looks like the UTF-8-test.txt file might have the following license:
https://creativecommons.org/licenses/by/4.0/
I will contact the author about this, but maybe it's acceptable to put it anyway into "test/sharness/t0051-object-data".
The test in #1599 should help fix the issue even if we cannot merge it now.
Looked into this a bit, and it seems to me the problem is not with the JSON marshaller/unmarshaller. I found that the data of the object returned from the call to core.Resolve()
(here) has extraneous data wrapping it.
It seems like the only reason ipfs get
is not affected by this is because it is run through the archiver whether you ask for archiving or not, which probably trims or something.
The easiest way to replicate the error is as follows (since ipfs object data
returns the data and nothing else, through the same mechanism as all other get-commands):
$ echo "hello world" > helloworld.txt
$ ipfs add helloworld.txt
added QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o helloworld.txt
$ ipfs object data QmT78zSuBmuS4z925WZfrqQ1qHaJ56DQaTfyMUF7F8ff5o > actual
$ diff <(hexdump helloworld.txt) <(hexdump actual)
1,2c1,3
< 0000000 6568 6c6c 206f 6f77 6c72 0a64
< 000000c
---
> 0000000 0208 0c12 6568 6c6c 206f 6f77 6c72 0a64
> 0000010 0c18
> 0000012
So somewhere along the line those extra bytes are added to the data, or there's some encoding I've missed, and it's just not correctly stripped for the ipfs object {data,get,...}
commands.
Looking in the blockstore of the repo:
$ hexdump 122046d4/122046d44814b9c5af141c3aaab7c05dc5e844ead5f91f12858b021eba45768b4c0e.data
0000000 120a 0208 0c12 6568 6c6c 206f 6f77 6c72
0000010 0a64 0c18
0000014
we see that those bytes are already there. So if these bytes are actually a bug and not intended, then they're probably being added by the add
-call, and not on fetching.
This is my first dive into the code base, so I might have missed or misunderstood something however.
This is with go 1.5 by the way, haven't tested with anything else.
@martingms i think what you're seeing is the unixfs protobuf: https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto which currently is stored in the data field of the ipfs object.
so, more clearly, we basically have something like:
# ipfs add
data := request.body
file := unixfs.NewFile(data)
obj := Node()
obj.Data = proto.Marshal(file)
# ipfs object data
obj := get(request.params.arg[0])
return obj.Data # with the serialized protobuf
# ipfs cat
obj := get(request.params.arg[0])
var file unixfs.File
file := proto.Unmarshal(obj.Data, &file)
return file.Data() # the raw data
tl;dr: i think this is fine.
Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json.
The problem is that the json encoder is interpreting the data field as a utf8 string, but it should be base64ing it instead. I don't think protobuf has anything to do with it
On Thu, 27 Aug 2015 12:14 Martin Gammelsæter notifications@github.com wrote:
Aha, I see. So the core issue then is that the data is not being umarshalled from protobuf before being put into the Data field of the output json.
— Reply to this email directly or view it on GitHub https://github.com/ipfs/go-ipfs/issues/1582#issuecomment-135247956.
David A Roberts https://davidar.io
Hm okay. Looks like there are similar issues with the XML encoder if that's the case...
$ ipfs object get --enc=xml QmNY5sQeH9ttVCg24sizH71dNbcZTpGd7Yb3YwsKZ4jiFP
<Node><Data>������UTF-8 decoder capability and stress test
<snip>
but object put
ting with --inputenc=xml
is not yet supported, so cannot verify.
Just to clarify things.
@jbenet wrote:
tl;dr: i think this is fine.
Do you mean that it should be expected that the hash from ipfs add
and from ipfs object put --inputenc=json
are not the same?
Or do you just mean that it is fine that ipfs object data
is displaying protobuf encoded data?
Or do you just mean that it is fine that ipfs object data is displaying protobuf encoded data?
this o/
@davidar I get the following test failure with your change:
ok 5 - 'ipfs object get' succeeds
expecting success:
test_cmp ../t0051-object-data/expected_getOut actual_getOut
> diff -u ../t0051-object-data/expected_getOut actual_getOut
--- ../t0051-object-data/expected_getOut 2015-06-12 14:04:03.680326279 +0000
+++ actual_getOut 2015-08-28 09:03:04.721395159 +0000
@@ -1,4 +1,4 @@
{
"Links": [],
- "Data": "\u0008\u0002\u0012\nHello Mars\u0018\n"
+ "Data": "CAISCkhlbGxvIE1hcnMYCg=="
}
\ No newline at end of file
@chriscool Then the test is broken, "CAISCkhlbGxvIE1hcnMYCg==" is the correct output
Edit: the fix for this bug will change the format of object get, so I would expect tests relying on the old (incorrect) format to break
If I 'fix' the above test, then there is the following error:
expecting success:
ipfs object put ../t0051-object-data/testPut.json > actual_putOut
Error: illegal base64 data at input byte 4
not ok 9 - 'ipfs object put file.json' succeeds
#
# ipfs object put ../t0051-object-data/testPut.json > actual_putOut
#
@chriscool That test is also broken. The old behaviour was that Data was interpreted as a utf8-encoded unicode string. However, that doesn't match the protobuf schema, so it breaks on binary data that isn't valid utf8. So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="
@jbenet @whyrusleeping jump in if I'm saying anything incorrect
So, any tests that that assume "Data":"some unicode string" need to be changed to "Data":"SomeBase64String=="
could swear we had it that way, not sure why it's not... oh i think @whyrusleeping removed it so it would look pretty and be easy to edit. (which is true, is nicer to edit in plain text). just dont think we can do that in json land
@chriscool any further progress on this?
I think this one falls in my lap. I wanted the data fields to be easy to edit... But previously i thought that it did escapes like \uABCD
or whatever.
Hey guys, me and TrekDev have been discussing this issue.. whats the consensus here? Is ipfs object put
suppose to just accept base64 and decode it internally on receive? Or will it continue to accept plain text?
I think the solution here might be to have different encoding types for these workflows. the json
encoding should have base64 encoded Data
field all the time, and then we might want to have another type (not sure on naming) that does json, but doesnt base64 encode the data for ease of use.
While we're on this topic, we should think about how ipld will handle this sort of thing...
While we're on this topic, we should think about how ipld will handle this sort of thing...
CBOR distinguishes between bytestrings and unicode (text) strings, so only the former would need to be base64'd when displayed as JSON.
Edit: but yeah, you'd need some way of indicating which strings are actually encoded bytestrings
Will this be fixed?
Any update on this?
There is now an option to set data field encoding to base64: https://github.com/ipfs/go-ipfs/blob/master/core/commands/object/object.go#L389
(use ipfs object get --data-encoding=base64 QmThing
for blocks with binary data)
The
Data
field ofipfs object get --enc=json
tries to decode UTF-8 binary data into a unicode string (rather than a base64 encoded blob). This causes problems with binary data that isn't valid UTF-8:Note that the hashes don't match, as the json encoding has corrupted the data.
The problem seems to be line
core/commands/object.go:202
. I've tried fixing it myself, but I don't speak go. It looks like go's json module is supposed to do base64 conversion automatically, but it isn't working for me.