grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

reflow/syntax: fix digest computation for comprehension expressions #143

Closed prb2 closed 2 years ago

prb2 commented 2 years ago

Bug Description

Suppose we have two local files, foo/bar/file1 and foo/file2 in testdata/testdir and the following reflow code:

val d = dir("testdata/testdir")
val filelist = list(d)

Filelist is a list of two-element tuples; the first element of each tuple is a string path and the second is a file object. If we wanted to unzip filelist into two separate lists, one with the paths and one with the files, we could use a list comprehension like this:

val (paths, files) = ([p | (p, x) <- filelist], [f | (y, f) <- filelist])

As expected, the result when evaluated by reflow is:

paths = ["foo/bar/file1", "foo/file2"]
files = [file("foo/bar/file1")), file("foo/file2"))]

The above code is correct, but we didn’t actually use the pattern matched x and y variables. We could instead use the ignore pattern to achieve the same result like this:

val (paths, files) = ([p | (p, _) <- filelist], [f | (_, f) <- filelist])

In this version, we would expect the same values for path and files. However, due to a bug we actually get:

paths = ["foo/bar/file1", "foo/file2"]
files = ["foo/bar/file1", "foo/file2"]

The values of files is incorrect; it should not be the same as paths.

Root Cause & Fix

The root cause of this bug is that two different expressions, [p | (p, _) <- filelist] and [f | (_, f) <- filelist], are resolved to the same digest; thus the value computed for the expression that is evaluated first is reused for the expression evaluated second. Note that the list being processed must be a delayed flow in order for this bug to occur.

This diff fixes the bug by modifying the environment that is used to compute the digest for comprehensions.

Before the fix, each identifier in the comprehension's pattern was bound to digestN(j) where j is the index of the identifier in the pattern. Since the ignore pattern _ is ignored, it is not included in the identifiers. As a result, both (p, _) and (_, f) will bind to digestN(0) because there is only 1 non-empty identifier.

With this fix, we no longer bind identifiers to the environment using digestN(j). Instead, we bind them to a digest of their matcher's path. This path is based on the actual value being pattern matched and is unaffected by the use of ignore patterns. A visual example of the fix:

// before
(p, _) -> digestN(0)
(_, f) -> digestN(0)

// after
(p, _) -> digest([tuple(0) value])
(_, f) -> digest([tuple(1) value])
prb2 commented 2 years ago

Fixed by 0c83f90deba26933941b2378dd373fcdca8eac9e