ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.16k stars 3.01k forks source link

HTTP APIs that accept binary data as query string args #7958

Open achingbrain opened 3 years ago

achingbrain commented 3 years ago

7939 exposed two issues with the HTTP API. A solution for one was proposed, the other was discussed at triage and I wanted to capture the scope of the work here.

Some of our API endpoints accept query string arguments that contain strings which will be interpreted as binary data. This causes problems for browsers and other JS clients in that the built in URL encoding functions accept strings and do UTF8 multi-byte code point conversions for byte values above 127/0x7F, so 0x80 becomes %EF%BF%BD instead of %80.

Instead we should allow sending multibase encoded strings for these arguments removing any ambiguity over their contents. The proposal was to add an additional flag that indicates to the API that the received string should be interpreted as a multibase encoded string and not a byte array.

I've done a quick pass over the HTTP API docs and these endpoints currently interpret an argument as a byte array, the good news is there aren't many:

/api/v0/dht/findpeer - arg needs encoding /api/v0/dht/findprovs - arg needs encoding /api/v0/dht/provide - arg needs encoding /api/v0/dht/put - arg needs encoding /api/v0/dht/query - arg needs encoding /api/v0/pubsub/pub - second arg needs encoding (also accepts payload in message body?)

These endpoints accept arguments that by convention would be B58 encoded multibase strings containing CIDs, but it is not explicit in the docs so could use a note saying the value 'should be an IPFS path or multibase encoded string containing a CID' or similar depending on context.

/api/v0/filestore/ls - arg may need encoding note? /api/v0/filestore/verify - arg may need encoding note? /api/v0/id - arg may need encoding note? /api/v0/name/resolve - arg may need encoding note? /api/v0/object/diff - arg may need encoding note? /api/v0/object/patch/add-link - arg may need encoding note? /api/v0/object/patch/append-data - arg may need encoding note? /api/v0/object/patch/rm-link - arg may need encoding note? /api/v0/object/patch/set-data - arg may need encoding note? /api/v0/ping - arg may need encoding note? /api/v0/stats/bw - peer may need encoding note? /api/v0/swarm/connect - arg may need encoding note? /api/v0/swarm/disconnect - arg may need encoding note?

There may be other affected endpoints.

BigLep commented 3 years ago

@Stebalien : to link to the relevant project proposals that would fix the right way.

Stebalien commented 3 years ago

The tricky part is mixing multibase-encoded arguments along with non-multibase-encoded arguments.

A possible solution is to reserve something like @... to mean "the following is multibase encoded". That would be a breaking change, but might be less terrible than some of the other options.

aschmahmann commented 3 years ago

but might be less terrible than some of the other options.

There are a bunch of options available here. One of these perhaps more terrible options, that's similar to the proposal above but trades off verbosity for transparency is having binary options get two flags --field-name and one of {--field-name-binary, --field-name-multibase, or --field-name-uses-multibase}. We'd need to do something similar (although perhaps not exactly the same) for arguments too.

mikeal commented 3 years ago

Whatever we go with, it would be good to put “multibase” in the name of the option/feature. Implying multibase in options marked “binary” is going to be more confusing for users, if we say “multibase” they’ll know what to Google to figure out what that means.

lidel commented 3 years ago

The tricky part is mixing multibase-encoded arguments along with non-multibase-encoded arguments.

I don't think we should allow mixing. When it comes to /api/v0/pubsub/pub both parameters should be in multibase because people use blobs in topic names as well (pretty sure I've seen it in the wild a while ago).

In the past we solved a variant of this problem in /api/v0/object/get by adding support for &data-encoding=base64, but it was scoped to a specific field in request body (object/get.js).

food for thought

To remove surface for errors like this from the entire HTTP API we need to add support for multibase envelope to arg params AND body in both request AND response. Not sure if we can do less, and avoid the problem coming back in the future.

proposal: single flag

I suggest a single flag &multibase-foobar=base64url (multibase-rpc? multibase-io? multibase-envelope?) that informs the /api/v0/ endpoint that:

Flag would be off by default (to maintain backward-compatibility), but js-ipfs-http-client should use this flag by default on /api/v0/pubsub/pub and similar endponts, just like it already does in object/get.js, quietly solving the problem for everyone who uses the lib. For everyone else experiencing issues when pushing binary instead of text, we would add a section at the top of https://docs.ipfs.io/reference/http/api/ (by updating ipfs/http-api-docs).

ps. Does not need to be base64url, but it is space-efficient and the safest default, so we should use it in examples and docs. There is a potential footgun with base64 when used in URL (and we would use it in arg). – we should detect it and return an error suggesting base64url instead.

achingbrain commented 3 years ago

How about:

proposal: use versioning

Make it /api/v1/pubsub/pub, /api/v1/dht/put etc and demand multibase for all binary sent on the query string or as fields in JSON instead of adding additional arguments which add complexity and the potential for mis-use.

No need to rev the version number for all endpoints at once, just a few at a time as and when they need upgrading.

lidel commented 3 years ago

Versioning via /api/v1 would also work, but introduction of a new API endpoint root not only complicates our code, but also complicates implementation of http client libraries in other languages, and may even require people to update their reverse proxy configs (if they only exposed v0).

At the end of the day we want to provide an easy fix for people, and a new query param feels like less work for the end user of the API: one can add it to the map passed to the API call, in some languages without even modifying the underlying library.

Not feeling strongly about this, but I feel we should pick the least expensive option for the existing ecosystem.

achingbrain commented 3 years ago

introduction of a new API endpoint root not only complicates our code

Kind of the opposite - I think it simplifies our code as with fewer options you have fewer branches to test. The more options you allow, the more execution paths there are in the code, and the more tests you need due to the combinatorial explosion.

We also don't have to keep on top of making options uniform across different APIs - see --datafieldenc, --data-encoding, --inputenc, --encoding etc for examples of this which makes our lives easier and our APIs more pleasant.

but also complicates implementation of http client libraries in other languages, and may even require people to update their reverse proxy configs (if they only exposed v0).

This doesn't seem like a deal breaker to me, just switch to use the new endpoint and encode your input args, no need for the user of the http client to change anything. You might have to add an extra line or a wildcard to your nginx config in the second case.

People using the API without a client will need to change how they use the API anyway, so we can take the opportunity to reduce our maintenance overhead by having fewer code paths to maintain and test. v0 of a given endpoint could even be extracted into a module that's rarely if ever updated.

a new query param feels like less work for the end user of the API

It usually does, and it's a quick hack for the implementer of the API as well, but over time adding options results in increased maintenance overhead for the implementer due to the resulting spaghetti code and conceptual overhead for the user as they now have to understand more things about the API in order to use it.

BigLep commented 3 years ago

This is blocked on side conversations that are being had to determine the way forward. See https://hackmd.io/cwWkQyXrRo-JZGUyAqXFvw

lidel commented 3 years ago

Decision summary after today's call:

Potential low priority tasks here: