gthd / pawk

A parallel awk-like language
Apache License 2.0
4 stars 1 forks source link

Implement dynamic analysis #13

Open dspinellis opened 4 years ago

dspinellis commented 4 years ago

Error examples and test cases

All examples should also work for an array cell, e.g. a[4].

Variable used in different blocks

NR == 1 { a = 1 }
NR == 2 { a = 2 }

Variable used for both assignment and accumulation

NR == 1 { a = 1 }
NR == 2 { a += 1 }
NR == 1 { a += 1 }
NR == 2 { a = 1 }

Reading from a different record

NR == 1 { a++ }
NR == 1000 { if (!a) print "Error" }
NR == 1 { a++ }
NR == 1000 && !a { print "Error" }

Correct examples and test cases

Variable only used for assignment

NR == 1 {a  = 1}
NR == 1000 {a  = 2}
END { if (a != 2) print "Error" }

Variable only used for accumulation

NR == 1 {a += 1}
NR == 1000 {a += 2}
NR == 2000 {a++}
NR == 3000 {a--}
NR == 4000 {++a}
END { if (a != 4) print "Error" }

Variable used only within the same record

{ a = $1; if (a) k++ }

Using a variable on the same record

!a  { a = $2 }
{ a = min(a, $2) }

Motivating example: descriptive statistics

!min_value { min_value = $1}
!max_value { max_value = $1}
{ 
  n++
  sum += $1
  min_value = min(min_value, $1)
  max_value = max(max_value, $1)
}
END { print "count=", n, "min=", min_value, "max=", max_value, "avg=", sum / n }
theosotr commented 4 years ago

From my understanding, some unsafe operations (e.g., assigning to an accumulator) can be detected by examining the syntax of the awk program.

But, I reckon that some checks cannot be performed statically e.g., you need dynamic analysis for the following case, right?

"Reading a variable's value, for an operation other than accumulation, results in a fatal error if the variable's associated file id/offset is different from the current file id/offset".

The problem with dynamic analysis is that depends on the current execution. You cannot guarantee that all executions of the program are error-free. For example, executions with different thread schedules may lead to different results.

I assume that allowing variable reads and writes only within the same block is too restrictive, isn't it?

gthd commented 4 years ago

@theosotr Actually, allowing variables reads and writes only in the same block, although restrictive, is the only way to guarrantee that there are no edge cases that have not been covered through dynamic analysis checking.

dspinellis commented 4 years ago

@theosotr Good point! If an error did not occur in a particular execution, then the result should be correct. But there's no guarantee that another execution will not fail. Static analysis could offer that guarantee. But the case you mention cannot be statically checked. Consider the following (correct) example, which could be checked with sophisticated static analysis.

NR == 1 { a = 1 }
NR + 100 == 101 { if (a) k++ }

Now , consider a predicate that cannot be statically checked, such as a regular expression.

# Should succeed
print 'hello\nworld' |
awk 'NR == 1 {a++} /hello/ && a {b++}'

# Should fail
print 'hello\nworld\nhello' |
awk 'NR == 1 {a++} /hello/ && a {b++}' 

@gthd What edge cases do you have in mind? I'm OK with (more restrictive) static checking, because I think that in practice it won't be too restrictive. But I think it will be more difficult to implement.

dspinellis commented 4 years ago

I added two more reduction operators, min and max, and updated the description accordingly.

theosotr commented 4 years ago

@dspinellis Yes, static analysis does not work in this case.

Dynamic analysis can indeed verify whether a violation occurs but has the problem of non-deterministic results, e.g., in your example above the awk program remained the same and what actually changed was the input.

Typically, in many cases such guarantees are given (e.g., all programs are data race free) by design, e.g., by introducing language constructs or a type system.