grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

Cannot get it to run hello world #145

Open masiulaniec opened 1 year ago

masiulaniec commented 1 year ago

I am trying to evaluate reflow but getting stopped in my tracks at Quick Start. I am simply trying to run the hello world example:

$ cat hello.rf 
val Main = exec(image := "ubuntu") (out file) {"
        echo hello world >>{{out}}
"}
$

I initially assumed that local mode (Docker) would be the quickest. So I ran:

$ reflow run -local hello.rf
2022/12/11 15:36:37 localcluster Init requires taskdb.TaskDB: unspecified
$                                                                                                                                                                                    

This smells like an internal error (dependency injection failure). It is surprising that TaskDB is a hard dependency when in -local mode. It contradicts the official documentation, which states that TaskDB is a soft dependency even for cluster mode.

Having given up on Docker, I fell back on the official EC2 quickstart from the README. The setup-ec2 / setup-s3-repository / setup-dynamodb-assoc trio worked fine. Unfortunely, reflow run failed in a surprising way:

$ reflow run hello.rf
reflow: reflow runtime: ===== started =====
reflow: reflow version: 1.27.0 (go1.18.4)
reflow: run ID: 44898ba0
reflow: evaluating program /Users/me/reflow/hello.rf
        (no params)
        (no arguments)
reflow: Trace: none (since nopTracer is in use)
reflow: evaluating with configuration: scheduler *sched.Scheduler snapshotter blob.Mux repository *blobrepo.Repository,url=s3://masiunet-reflow-test/ assoc *dydbassoc.Assoc,TableName=masiunet-reflow-test flags nocache,norecomputeempty,topdown flowconfig hashv2 cachelookuptimeout 20m0s imagemap map[ubuntu:index.docker.io/library/ubuntu@sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea] dotwriter(*os.File)
reflow: (flow 3dca1cc0): reviseResources {mem:500.0MiB cpu:1 disk:0B}: resources {mem:500.0MiB cpu:1 disk:0B} are way higher than max {mem:0B cpu:128 disk:250.0GiB intel_avx:128 intel_avx2:128 intel_avx512:128 intel_turbo:128}
reflow:  ->  hello.Main   3dca1cc0 exec   exec ..aec165018ef44a4d2d46c7cdea80a9dff0d1ea echo hello world >>{{out}}
reflow: hello.Main 3dca1cc0 /Users/me/reflow/hello.rf:1:16:
        resources: {mem:500.0MiB cpu:1 disk:0B}
        sha256:143d42326a7796eab8314a0030604c95e7afad1587ce681492f911b501b54db9
        sha256:b5cf39692f785fbbbc9ac03dbc00c2bde0ff2076d0373724293f810b2f1276b3
        sha256:3dca1cc06adb7b4a76dbc5a526c60ebed36ad8793b5a13cc6449c4c7ff329c8e
        index.docker.io/library/ubuntu@sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea
        command:
            echo hello world >>{{out}}
        where:
reflow:  <-  hello.Main   3dca1cc0 err    exec 0s ?
        error resources exhausted: requested resources {mem:500.0MiB cpu:1 disk:0B} not satisfiable even by largest available instance type x2iedn.32xlarge with resources {mem:0B cpu:128 disk:250.0GiB intel_avx:128 intel_avx2:128 intel_avx512:128 intel_turbo:128}
        /Users/me/reflow/hello.rf:1:16
        index.docker.io/library/ubuntu@sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea
        command:
            echo hello world >>{{out}}
        where:
        profile:
            cpu mean=0.0 max=0.0 (N=0, duration=0s)
            mem mean=0B max=0B (N=0, duration=0s)
            disk mean=0B max=0B (N=0, duration=0s)
            tmp mean=0B max=0B (N=0, duration=0s)
reflow: total n=1 time=0s
        ident      n   ncache runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB) requested
        hello.Main 1   0                                        

reflow: marking run done after nonrecoverable error resources exhausted: requested resources {mem:500.0MiB cpu:1 disk:0B} not satisfiable even by largest available instance type x2iedn.32xlarge with resources {mem:0B cpu:128 disk:250.0GiB intel_avx:128 intel_avx2:128 intel_avx512:128 intel_turbo:128}
reflow: resources exhausted: requested resources {mem:500.0MiB cpu:1 disk:0B} not satisfiable even by largest available instance type x2iedn.32xlarge with resources {mem:0B cpu:128 disk:250.0GiB intel_avx:128 intel_avx2:128 intel_avx512:128 intel_turbo:128}
$                                                                                                                                                                                    

The advertised mem:0B looks suspicious but I have not looked deeper than that.

I tried a few older release builds but they all fail with the same error. If I go back far enough, I get a different error:

$ ~/Downloads/reflow1.13.0.darwin.amd64 run hello.rf
infra.Init: provider ec2cluster for type *ec2cluster.Cluster: missing AMI parameter
$                                                                                                                                                                                    

I was going to attempt some code fixups but here I encountered yet more trouble: the standard go install workflow does not work:

$ ~/sdk/go1.19.3/bin/go install github.com/grailbio/reflow/cmd/reflow@latest
go: downloading github.com/grailbio/reflow v0.0.0-20221206232358-04b01f719b84
go: finding module for package github.com/grailbio/base/s3util
go: finding module for package github.com/grailbio/base/cloud/spotadvisor
go: finding module for package github.com/grailbio/base/cloud/spotfeed
go/pkg/mod/github.com/grailbio/reflow@v0.0.0-20221206232358-04b01f719b84/ec2cluster/ec2cluster.go:33:2: module github.com/grailbio/base@latest found (v0.0.10), but does not contain package github.com/grailbio/base/cloud/spotadvisor
go/pkg/mod/github.com/grailbio/reflow@v0.0.0-20221206232358-04b01f719b84/tool/cost.go:15:2: module github.com/grailbio/base@latest found (v0.0.10), but does not contain package github.com/grailbio/base/cloud/spotfeed
go/pkg/mod/github.com/grailbio/reflow@v0.0.0-20221206232358-04b01f719b84/blob/s3blob/s3blob.go:27:2: module github.com/grailbio/base@latest found (v0.0.10), but does not contain package github.com/grailbio/base/s3util
$                                                                                                                         

My guess is that go.mod is not being kept in sync with the internal Bazel repo...

I eventually managed to get it to build after a series of guesses around package upgrades and some local patching but by that point I lost any confidence that my local sandbox bears any resemblance to what upstream uses. Belatedly, I realized I maybe could have extracted an up-to-date go.mod from the buildinfo metadata embedded in the released binaries but I ran out of time dedicated to this experiment.

Overall, a surprisingly poor experience for a project in its 1.x life phase. It's a shame because the technology seems interesting.

swami-m commented 1 year ago

@masiulaniec Not sure if you are still looking at using reflow, but for the original taskdb problem, I think the following solution might work:

> reflow config -marshal > /tmp/reflow_config
> vim /tmp/reflow_config # and add the following line

taskdb: noptaskdb

> reflow run -config /tmp/reflow_config -local hello.rf

Perhaps @fialhopm might be able to confirm.

fialhopm commented 1 year ago

Apologies for the very late response.

Unfortunately, specifying noptaskdb appears to not be sufficient to get hello.rf to work in local mode.

If you're using the us-east-1 region, then the following should solve the resources exhausted error:

> reflow config -marshal > /tmp/reflow_config
> vim /tmp/reflow_config # and remove the following instance types

  - c6a.32xlarge
  - c6a.48xlarge
  - c6id.32xlarge
  - g5.48xlarge
  - i4i.32xlarge
  - m6a.32xlarge
  - m6a.48xlarge
  - m6id.32xlarge
  - r6a.32xlarge
  - r6a.48xlarge
  - r6i.32xlarge
  - r6id.32xlarge
  - trn1.32xlarge
  - x2idn.32xlarge
  - x2iedn.32xlarge

> reflow -config /tmp/reflow_config run -local hello.rf

This will not work for other regions.

We'll include fixes for both issues in the next release, which will hopefully go out within the next month.