RFC: Revise Transform Processing

b5 commented 6 years ago

This one's based on a meeting we had at Qri to deal with questions that have arisen from early user feedback. This addresses some critical design flaws in our current transform implementation.

Apologies, we're on a bit of a time crunch, I'm hoping to get this merged in time for this week's release.

stuartlynn commented 6 years ago

This looks good to me. I think the main option you are presenting is better than the close alternative. It makes more intuitive sense to me and I think you'r correct that it means that if a transform script doesn't need to download anything, then it's easier to ignore qri.results.download than a passed in context object.

Not sure about calling it qri.results.download though as results are intuitively what I would think of as the end of the transformation process, not an intermediary state (which it feels like the downloads are).

Just a quick tangent as well, which is not 100% relevant here, is if you are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset. If you are encouraging people to use Qri as their main way of sharing data, it seems restrictive to not allow them to use those datasets in the transform of another Qri dataset. As I say probably for another RFC but wanted to mention this here.

b5 commented 6 years ago

are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset.

💯 yes. We currently have qri.load_dataset_head("peer/dataset") for getting other Qri datasets. When you import datasets in this way Qri automatically records the versioning info so we can later ship features to help you maintain sync across versions.

As for grabbing raw IPFS data within starlark, absolutely planned, in a few different instantiations. The first module like to rfc-then-ship is a module called cafs that provides an API for doing native interaction with our cafs interface. I have no idea if we'll ever pull this off, but by calling it cafs we leave the door open to something like cafs.fetch('/ipfs/QmFoo...') and maybe cafs.fetch('dat://4cbe2f...'), which would fetch via if the underlying cafs instance had a DAT connection (or some way to resolve dat links). cafs can currently resolve IPNS links, so that should come for free when we ship this cafs module.

All this would happen in the download step of a transform. If other IPFS tech like IPLD stabilizes we can ship 'em as separate Qri modules for starlark.

Ideally starlark has access to both IPFS & regular internets before December 2018.

ramfox commented 6 years ago

Not sure about calling it qri.results.download though as results are intuitively what I would think of as the end of the transformation process, not an intermediary state (which it feels like the downloads are).

This is interesting. And I understand what you are saying, but I kind of like the fact that it seems like qri.results is the end of a process. It signifies to the user, without us having to explain, that you we have already run their download or some_special_function func. However, still brainstorming to think of a better word, although none has presented itself yet.

Just a quick tangent as well, which is not 100% relevant here, is if you are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset. If you are encouraging people to use Qri as their main way of sharing data, it seems restrictive to not allow them to use those datasets in the transform of another Qri dataset. As I say probably for another RFC but wanted to mention this here.

100% agree, opened an issue #26

edit: just read @b5 's comment. 🙌 for cafs rather than ipfs

b5 commented 6 years ago

Very good points raised about qri.results. I have an update on technical implementation what will affect this anyway. I started work on a skytf branch to confirm the plan laid out in this RFC and ran into a critical issue. An example of the issue:

load("qri.sky", "qri")    # after this load call, skylark "freezes" all values in qri.sky 
                          # the "qri" value loaded should now be immutable

def download():
  return ["hello", "world"]

def transform(ds):
  qri.results.download     # <- this implies a value in the qri object is mutated :(

By placing return values in qri.results, we're violating a starlark principle. Because we have full control of how starlark behaves within the Qri environment, we can cheat this, but I think it's a bad idea to go against the grain this early.

Instead, I think we should revisit the "close second" option, passing all functions a context:

def download(context):
  return ["hello", "world"]

def transform(ds, context):
  ds.set_body(context.download)

This is effectively moving the duty of passing state between special functions out of qri.results and into context, which is made an explicit argument. Because context is being passed as an argument, it's no longer considered a "frozen" value by starlark, and can be mutated.

In this example I've dropped the results in favor of assigning return values of special functions directly to context.[special_function_name]. I agree with others on "results" implying completion, which isn't nice. This feels better to me. It also means we get to take a more "planned" approach to how context works. It would be an error if a user tried to set context.download directly, but we could hand them an API like context.set("val", val) and context.get("val") to pass arbitrary data around.

pros	cons
no longer need to `load("qri.sky", "qri")`	always have to accept a context
greater long term flexibility	steeper initial learning curve
less "magical" IMHO	we'll need to clearly document what context values will be available when

Thoughts?

ramfox commented 6 years ago

Womp womp.

Yes I agree this is the best choice given the constraints :)

dustmop commented 6 years ago

How do we handle the case of downloading multiple different webpages? Return a dict with a key pointing to each?

def download(context):
  a = http.get("http://example.com/a")
  b = http.get("http://example.com/b")
  c = http.get("http://example.com/c")
  return {"a": a, "b": b, "c": c}

b5 commented 6 years ago

If's destined to be a dataset component, I think so. I've used that pattern in the past, using surt urls as keys to create a datasets. We also use a similar pattern as a test case.

You didn't ask this, but to add some detail, a justification for passing in context here may also include passing around that might be destined for metadata like, num urls requested:

def download(context):
  a = http.get("http://example.com/a")
  b = http.get("http://example.com/b")
  c = http.get("http://example.com/c")
  context.set("num_urls_requested", 3)
  return {"a": a, "b": b, "c": c}

This frees the return value from having to give back all data, avoiding stuff like:

def download():
  # ...
  return { "num_urls_requested" : 3, "data": { "a", a, "b": b, "c": c }}

which I think is an antipattern if we have context.

b5 commented 6 years ago

lovely. @ramfox any objections to merging this?

b5 commented 6 years ago

🕺 another one down!

qri-io / rfcs

RFC: Revise Transform Processing #24