RFC: separate API for adding multiple?

alanshaw commented 5 years ago

Here in IPFSX adding content always returns an iterator, which is a nice consistent API, but when adding a single file calling .first() or .last() on the iterator feels a little clunky and it seems to me that the more usual case is actually adding a single file, so I'd like to keep that API as simple and clean as possible.

I've ruled out:

Automatically changing the return type to be a Promise if we detect a single item is being added and an Iterable otherwise. This is too weird, and I think people would get confused.
Always return a Promise that is monkey patched to also be Iterable 😱. This makes me feel weird, and it's not obvious what the promise would resolve to when adding multiple and I think people will accidentially await when they should be iterating i.e. its also confusing
Always return a Promise that resolves to an Iterable if adding multiple - similar confusion ranking 😂. There's also no need to force execution onto the next tick if we're going to resolve immediately to an async iterable.

So, what if this instead:

ipfs.add (single, returns a Promise) & ipfs.addEach (multiple, returns an async iterator)

ipfs.add accepts a Buffer or Iterable that yields buffers and returns a Promise that resolves to a CID e.g.

const cid = await ipfs.add(Buffer.from('hello world'))
// or
const cid = await ipfs.add(fs.createReadStream('/path/to/file'))

ipfs.addEach accepts an iterable (optionally async) where each yielded item is a file to be added. Each item can be a Buffer or Iterable that yields buffers or a { path, content } object. It returns an async iterable of { cid, path } e.g.
```
const adder = ipfs.addEach([
  { path: 'root/file1', content: fs.createReadStream(/*...*/) },
  { path: 'root/file2', content: fs.createReadStream(/*...*/) }
])

for await (const { cid, path } of adder)
  // ...
```

...and pushing this out to other API methods that write DAG nodes:

ipfs.block.put & ipfs.block.putEach
ipfs.dag.put & ipfs.dag.putEach
etc.

I want to keep the API surface area as small as possible so this is a step away from that. In general I believe that functions should be permissive in what they accept and consistent in what they return. However, we're not really talking about that here, what we're talking about is splitting on difference in operation - ipfs.add behaves very differently when you're adding multiple files than it does when adding a single file and combining the two together makes it very difficult to return a consistent value with a "nice" API for both cases.

Interested to hear opinions and alternatives and settle on the most convenient and best API. Adding stuff programmatically to IPFS needs to be way easier. I think what we have here in IPFSX is a good start but I'd like to make it even better!

cc @olizilla @lidel @hacdias @achingbrain @hugomrdias @daviddias @jacobheun @vasco-santos @mikeal @vmx @gozala (plz mention others if I've missed anyone that might be interested...)

jacobheun commented 5 years ago

I like this approach. The added API surface area is entirely reasonable. Overloading .add, .put, etc, as you mentioned just leads to confusion. We currently have a lot of methods that try to overload for simplicity, but it just makes it harder to follow and harder for us to maintain that code. Having clearly defined function contracts makes it easier for users to make informed decisions about which one they need to use. Since js doesn't have proper overloading, I'm a big 👍 to this.

daviddias commented 5 years ago

This is a kind of proposal that once it is on front of us, one goes like "why haven't we realized this sooner!" :D You have my 👍

vmx commented 5 years ago

I copied the original API, where the methods always return an Iterable for the new IPLD API. I was concerned about the same issue as mentioned here. After porting a lot of code, using .first() didn't feel as bad as I thought, which was surprising to me. The consistency is a big win.

For me it is easy to remember that those functions always return an Iterable. If it's a single item only I have the convenient .first() call. I only need to learn one API primitive, I can even forget about .first() if I want to, as long as I remember that the return values are an Iterable. With this change I need to remember that there's also an *each() version of the functions.

Though I can also see that the "single item version" is probably used more often, so it does make sense to make that as easy as possible. As you can see on my reply, I'm a bit torn.

olizilla commented 5 years ago

I share @vmx's hesitation. We're trading off calling .first() on the return value for having 2 api entries for every insert-ish operation. API surface area has a cost in docs, and cognitive load, even if it may simplify the implementation. IPFSx greatly improves that, so perhaps we have some credit in the api-surface-area bank, but I'd like to see the minimal version of the ipfsx api get released into the wild before we explore addding these.

Anecdotally, the code I'm writing has to deal with the requirement "let the user add 1 or more things" so it ends up being more convenient to always pass in an array, rather than handling the special case of a single item. I'm sure there are cases where having a streamlined "add one" api would be nice, but I'm not sure it warrents the extra surface area yet, as I'm not running into that need.

Perhaps this should be decided just on "what is easist to teach?". For the purposes of deciding on this issue, I should clarify that I am essentially an abstain, as I think there are pros and cons to either.

hacdias commented 5 years ago

I personally like the idea. Although I agree with @vmx and @olizilla. I think that it isn't necessarily bad to always return an iterable on .add. If the API for .add is to receive one or more things to add, I wouldn't expect it to return different types depending on the arguments.

Gozala commented 5 years ago

I’m +1 on degenerelizing APIs as in practice it’s a lot easier to reason when reading such code and type checkers (if you use one) are generally more helpful than if you accept arbitrary inputs and produce output depending in the input passed.

Gozala commented 5 years ago

I would even go as far as suggest to ditch add many, unless it does some kind of optimization I fail to see. IMO it fallls in a realm of utility functions that are agnostic of IPFS and it’s fairly useful to have such combinator library independently

To further elaborate my point - I think different strategy may make sense in different instances sometimes prallel puts and other times sequential and everything in between. Backing put many will likely end up choosing one strategy and in consequence make any other startegy a second class

jacobheun commented 5 years ago

I would even go as far as suggest to ditch add many, unless it does some kind of optimization I fail to see. IMO it fallls in a realm of utility functions that are agnostic of IPFS and it’s fairly useful to have such combinator library independently

I like this approach and think it keeps the api simple. If there are methods where we'd recommend parallelization or serialization for arrays, I think that is something that could be added to the docs for that method. jsdoc examples could be a really useful here.

alanshaw commented 5 years ago

I would even go as far as suggest to ditch add many, unless it does some kind of optimization I fail to see

Ok so, adding multiple/many allows you to create a directory structure, i.e. unixfs directory nodes will be created automatically with links pointing to directory contents. So, for example adding something like this:

ipfs.add([
  { path: 'root/file1', content: fs.createReadStream(/*...*/) },
  { path: 'root/dir/nestedfile', content: fs.createReadStream(/*...*/) },
  { path: 'root/file2', content: fs.createReadStream(/*...*/) }
])

Would yield something like:

[
    { path: 'root/file1', hash: 'QmHashFile1' },
    { path: 'root/dir/nestedfile', hash: 'QmHashNestedfile' },
    { path: 'root/dir', hash: 'QmHashDir' },
    { path: 'root/file2', hash: 'QmHashFile2' },
    { path: 'root', hash: 'QmHashRoot' }
]

And then you can use QmHashRoot to address content like:

/ipfs/QmHashRoot/file1
/ipfs/QmHashRoot/dir/nestedfile
/ipfs/QmHashRoot/file2

The only way to achieve this same structure (without the ability to add multiple) would be to manually create the protobuf DAG nodes that will act as the directories and add them to IPFS using the DAG API.

Gozala commented 5 years ago

Let me fist state that I totally understand that some of the API decisions might be irreversible at this point or too costly to be worth it. That being said as I was asked for opinion let me elaborate all on the concerns I have with current API and propose a way they could be addressed even though I understand doing some of that may not be visible for various reasons.

ipfs.add([
  { path: 'root/file1', content: fs.createReadStream(/*...*/) },
  { path: 'root/dir/nestedfile', content: fs.createReadStream(/*...*/) },
  { path: 'root/file2', content: fs.createReadStream(/*...*/) }
])

//=>
AsyncIterable.of([
    { path: 'root/file1', hash: 'QmHashFile1' },
    { path: 'root/dir/nestedfile', hash: 'QmHashNestedfile' },
    { path: 'root/dir', hash: 'QmHashDir' },
    { path: 'root/file2', hash: 'QmHashFile2' },
    { path: 'root', hash: 'QmHashRoot' }
])

As an author of function like foo(suff, ...other) => { .... ipfs.add(stuff) ... } I'm forced to do bunch of runtime checks on stuff / return value to figure out what hash do I need. Separate APIs for batch add and single add does help for single file but not necessarily with batch.
Is batch atomic operation ? What if async iterable fails on the n-th item ? Does the whole operation fail or does prior ones still succeed ? I hope it's atomic otherwise user needs to deal with the fact that it's partially succeeded, which means figuring out what succeeded and what failed etc...
- what if first entry fails but others don't ?
- If it's atomic then asyncIterator makes little sense because you either fail or succeed.
After I did batch add I need to then figure out what CIDs correspond to what path. That assumes invariant that each path will have CID in the resulting async iterator, but if that invariant isn't encoded in thy value itself, which means user needs to verify and deal with that.

Ok so, adding multiple/many allows you to create a directory structure, i.e. unixfs directory nodes will be created automatically with links pointing to directory contents. So, for example adding something like this:

I would argue alternative API might be able to accomplish the same goal without raising all of those concerns e.g.:


const batch = ipfs.batch()
await batch.write("root/file1", stream1);
await batch.write("root/dir/nestedfile", stream2);
await batch.write("root/file2",  stream3);

ipfs.add(batch) // => 'QmHashRoot'

If you do really want to capture all the CIDs it might be better to return instance like following instead:

class CID {
  constructor(cid, links) {
    this.cid = cid
    this.links = links
  }
  toString() {
    return this.cid
  }
  get(path) {
    for (const link of links) {
      if (link.path == path) {
        return link.hash
      }
    }
    return null
  }  
}

Note that above is clearly atomic operation that either succeeds or fails. User does not need to do any mapping between inputs and outputs and most of the time can use return value as is. However there is still a chance that return will not contain all entries.

There is still one use case that I think above API does not address, which is does not allow tracking a progress. If allowing that is a goal, there is yet another alternative that can be considered to address that as well. Which is instead of returning Promise<CID> you could instead return something like this:

interface AddResult {
  result:Promise<CID>
  progress:AsyncIterable<{ path:string, cid:CID }>
}

Gozala commented 5 years ago

I think it's also worth considering that there are web APIs that have batch semantics like FileList and FormData APIs. So technically you don't even need ipfs.batch() you could instead do following:

const bundle = new FormData()
bundle.append("root/text", "HelloWorld")
const response = await fetch(myURL)
const blob = await response.blob()
bundle.append("root/dir/nested", blob)

ipfs.add(bundle)

Unfortunately FormData, File and Blob APIs don't yet support ReadableStreams but I'm confident they will soon enough, furthermore I'm not sure it really matters here either way.

Gozala commented 5 years ago

Consolidated my thoughts on API design here https://gozala.hashbase.io/posts/Constraints%20of%20an%20API%20design/

alanshaw commented 5 years ago

Thanks @Gozala - lets continue the IPLD API design convo here though! https://github.com/ipld/js-ipld/pull/191

ipfs-shipyard / ipfsx

RFC: separate API for adding multiple? #10