Closed b5 closed 6 years ago
This looks good to me. I think the main option you are presenting is better than the close alternative. It makes more intuitive sense to me and I think you'r correct that it means that if a transform script doesn't need to download anything, then it's easier to ignore qri.results.download
than a passed in context object.
Not sure about calling it qri.results.download
though as results are intuitively what I would think of as the end of the transformation process, not an intermediary state (which it feels like the downloads are).
Just a quick tangent as well, which is not 100% relevant here, is if you are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset. If you are encouraging people to use Qri as their main way of sharing data, it seems restrictive to not allow them to use those datasets in the transform of another Qri dataset. As I say probably for another RFC but wanted to mention this here.
are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset.
💯 yes. We currently have qri.load_dataset_head("peer/dataset")
for getting other Qri datasets. When you import datasets in this way Qri automatically records the versioning info so we can later ship features to help you maintain sync across versions.
As for grabbing raw IPFS data within starlark, absolutely planned, in a few different instantiations. The first module like to rfc-then-ship is a module called cafs
that provides an API for doing native interaction with our cafs
interface. I have no idea if we'll ever pull this off, but by calling it cafs
we leave the door open to something like cafs.fetch('/ipfs/QmFoo...')
and maybe cafs.fetch('dat://4cbe2f...')
, which would fetch via if the underlying cafs instance had a DAT connection (or some way to resolve dat links). cafs can currently resolve IPNS links, so that should come for free when we ship this cafs
module.
All this would happen in the download
step of a transform. If other IPFS tech like IPLD stabilizes we can ship 'em as separate Qri modules for starlark.
Ideally starlark has access to both IPFS & regular internets before December 2018.
Not sure about calling it
qri.results.download
though as results are intuitively what I would think of as the end of the transformation process, not an intermediary state (which it feels like the downloads are).
This is interesting. And I understand what you are saying, but I kind of like the fact that it seems like qri.results
is the end of a process. It signifies to the user, without us having to explain, that you we have already run their download
or some_special_function
func. However, still brainstorming to think of a better word, although none has presented itself yet.
Just a quick tangent as well, which is not 100% relevant here, is if you are going to include an IPFS module as well for pulling in other distributed datasets? Having http is a must to interact with the existing web but it seems weird not to be able to grab a distributed dataset in the downloads phase by IPFS hash or IPFSNS, or another named Qri dataset. If you are encouraging people to use Qri as their main way of sharing data, it seems restrictive to not allow them to use those datasets in the transform of another Qri dataset. As I say probably for another RFC but wanted to mention this here.
100% agree, opened an issue #26
edit: just read @b5 's comment. 🙌 for cafs rather than ipfs
Very good points raised about qri.results
. I have an update on technical implementation what will affect this anyway. I started work on a skytf branch to confirm the plan laid out in this RFC and ran into a critical issue. An example of the issue:
load("qri.sky", "qri") # after this load call, skylark "freezes" all values in qri.sky
# the "qri" value loaded should now be immutable
def download():
return ["hello", "world"]
def transform(ds):
qri.results.download # <- this implies a value in the qri object is mutated :(
By placing return values in qri.results
, we're violating a starlark principle. Because we have full control of how starlark behaves within the Qri environment, we can cheat this, but I think it's a bad idea to go against the grain this early.
Instead, I think we should revisit the "close second" option, passing all functions a context:
def download(context):
return ["hello", "world"]
def transform(ds, context):
ds.set_body(context.download)
This is effectively moving the duty of passing state between special functions out of qri.results
and into context
, which is made an explicit argument. Because context
is being passed as an argument, it's no longer considered a "frozen" value by starlark, and can be mutated.
In this example I've dropped the results
in favor of assigning return values of special functions directly to context.[special_function_name]
. I agree with others on "results" implying completion, which isn't nice. This feels better to me. It also means we get to take a more "planned" approach to how context works. It would be an error if a user tried to set context.download
directly, but we could hand them an API like context.set("val", val)
and context.get("val")
to pass arbitrary data around.
pros | cons |
---|---|
no longer need to load("qri.sky", "qri") |
always have to accept a context |
greater long term flexibility | steeper initial learning curve |
less "magical" IMHO | we'll need to clearly document what context values will be available when |
Thoughts?
Womp womp.
Yes I agree this is the best choice given the constraints :)
How do we handle the case of downloading multiple different webpages? Return a dict with a key pointing to each?
def download(context):
a = http.get("http://example.com/a")
b = http.get("http://example.com/b")
c = http.get("http://example.com/c")
return {"a": a, "b": b, "c": c}
If's destined to be a dataset component, I think so. I've used that pattern in the past, using surt urls as keys to create a datasets. We also use a similar pattern as a test case.
You didn't ask this, but to add some detail, a justification for passing in context
here may also include passing around that might be destined for metadata like, num urls requested
:
def download(context):
a = http.get("http://example.com/a")
b = http.get("http://example.com/b")
c = http.get("http://example.com/c")
context.set("num_urls_requested", 3)
return {"a": a, "b": b, "c": c}
This frees the return value from having to give back all data, avoiding stuff like:
def download():
# ...
return { "num_urls_requested" : 3, "data": { "a", a, "b": b, "c": c }}
which I think is an antipattern if we have context.
lovely. @ramfox any objections to merging this?
🕺 another one down!
This one's based on a meeting we had at Qri to deal with questions that have arisen from early user feedback. This addresses some critical design flaws in our current transform implementation.
Apologies, we're on a bit of a time crunch, I'm hoping to get this merged in time for this week's release.