dspinellis commented 4 years ago

Each thread maintains its own pool of plain variables and associative arrays
A reduction operator can be one of the following
- Ordered assignment (see below)
- Accumulation, i.e. the operations x++, ++x, x--, --x, x += value, x -= value
- One of a = min(a, expr), a = max(a, expr), a = and(a, expr), a = or(a, expr), a = xor(a, expr),
Variable refers to a plain variable, e.g. a or an associative array cell, e.g. a["apple"]
Each variable is associated with
- a file id (file ordinal number) and file offset, initially undefined
- an indicator of whether it is used for accumulation or assignment.
When an initialized variable is assigned its type indicator is set to the reduction operator used
Assigning to a initialized variable with a reduction operator different from the existing one results in a fatal error
Reading a variable's value, for an operation other than the use in a reduction operator (accumulation, min, max), results in a fatal error if the variable's associated file id/offset is different from the current file id/offset
Before executing the END block the following reduction takes place:
- Each variable/cell set typed for accumulation is summed across threads into a single value with the same name
- Each variable/cell set typed for min, max, and, or, xor has that function applied to the values across threads resulting in a single value with the same name
- The thread values of assignment reduction variable/array cell are ordered by ordinal file id and offset, and a single variable is created with the value of the last variable in the ordered set.
- If some variables in a set are typed differently from others, then a fatal error occurs.
Fatal errors indicate the reason, e.g. Line 4, record 54: assignment to an accumulator variable

Error examples and test cases

All examples should also work for an array cell, e.g. a[4].

Variable used in different blocks

NR == 1 { a = 1 }
NR == 2 { a = 2 }

Variable used for both assignment and accumulation

NR == 1 { a = 1 }
NR == 2 { a += 1 }

NR == 1 { a += 1 }
NR == 2 { a = 1 }

Reading from a different record

NR == 1 { a++ }
NR == 1000 { if (!a) print "Error" }

NR == 1 { a++ }
NR == 1000 && !a { print "Error" }

Correct examples and test cases

Variable only used for assignment

NR == 1 {a  = 1}
NR == 1000 {a  = 2}
END { if (a != 2) print "Error" }

Variable only used for accumulation

NR == 1 {a += 1}
NR == 1000 {a += 2}
NR == 2000 {a++}
NR == 3000 {a--}
NR == 4000 {++a}
END { if (a != 4) print "Error" }

Variable used only within the same record

{ a = $1; if (a) k++ }

Using a variable on the same record

!a  { a = $2 }
{ a = min(a, $2) }

Motivating example: descriptive statistics

!min_value { min_value = $1}
!max_value { max_value = $1}
{ 
  n++
  sum += $1
  min_value = min(min_value, $1)
  max_value = max(max_value, $1)
}
END { print "count=", n, "min=", min_value, "max=", max_value, "avg=", sum / n }

theosotr commented 4 years ago

From my understanding, some unsafe operations (e.g., assigning to an accumulator) can be detected by examining the syntax of the awk program.

But, I reckon that some checks cannot be performed statically e.g., you need dynamic analysis for the following case, right?

"Reading a variable's value, for an operation other than accumulation, results in a fatal error if the variable's associated file id/offset is different from the current file id/offset".

The problem with dynamic analysis is that depends on the current execution. You cannot guarantee that all executions of the program are error-free. For example, executions with different thread schedules may lead to different results.

I assume that allowing variable reads and writes only within the same block is too restrictive, isn't it?

gthd commented 4 years ago

@theosotr Actually, allowing variables reads and writes only in the same block, although restrictive, is the only way to guarrantee that there are no edge cases that have not been covered through dynamic analysis checking.

dspinellis commented 4 years ago

@theosotr Good point! If an error did not occur in a particular execution, then the result should be correct. But there's no guarantee that another execution will not fail. Static analysis could offer that guarantee. But the case you mention cannot be statically checked. Consider the following (correct) example, which could be checked with sophisticated static analysis.

NR == 1 { a = 1 }
NR + 100 == 101 { if (a) k++ }

Now , consider a predicate that cannot be statically checked, such as a regular expression.

# Should succeed
print 'hello\nworld' |
awk 'NR == 1 {a++} /hello/ && a {b++}'

# Should fail
print 'hello\nworld\nhello' |
awk 'NR == 1 {a++} /hello/ && a {b++}'

@gthd What edge cases do you have in mind? I'm OK with (more restrictive) static checking, because I think that in practice it won't be too restrictive. But I think it will be more difficult to implement.

dspinellis commented 4 years ago

I added two more reduction operators, min and max, and updated the description accordingly.

theosotr commented 4 years ago

@dspinellis Yes, static analysis does not work in this case.

Dynamic analysis can indeed verify whether a violation occurs but has the problem of non-deterministic results, e.g., in your example above the awk program remained the same and what actually changed was the input.

Typically, in many cases such guarantees are given (e.g., all programs are data race free) by design, e.g., by introducing language constructs or a type system.

gthd / pawk

Implement dynamic analysis #13

Error examples and test cases

Variable used in different blocks

Variable used for both assignment and accumulation

Reading from a different record

Correct examples and test cases

Variable only used for assignment

Variable only used for accumulation

Variable used only within the same record

Using a variable on the same record

Motivating example: descriptive statistics