(formerly FloatTracker.jl)
Track NaN
and Inf
generation and propagation in your code.
Available on JuliaHub
# Pull in TrackedFloats
using TrackedFloats
# Wrap inputs in a TrackedFloat* type
num = TrackedFloat64(-42.0)
# Watch as a NaN gets born
should_be_nan = sqrt(num)
# Flush TrackedFloats's logs
tf_flush_logs()
TrackedFloats.jl
is a library that provides three new types: TrackedFloat16
, TrackedFloat32
, and TrackedFloat64
.
These behave just like their FloatN
counterparts except that they detect and log instances of exceptional floating point values. (E.g. NaN
or Inf
)
There are three kinds of events that can happen during the lifetime of an exceptional floating point value:
0.0 / 0.0 → NaN
)NaN + 2.0 → NaN
, 2.0 / 0.0 → Inf
)NaN > 1.0 → false
)These events are then stored in a buffered log and can be written out to a file during or after the execution of a program.
JuliaCon 2023 talk by Ashton:
using TrackedFloats
tf_config_logger(filename="max")
function maximum(lst)
max_seen = 0.0
for x in lst
if ! (x < max_seen)
max_seen = x # swap if new val greater
end
end
max_seen
end
function maximum2(lst)
foldl(max, lst)
end
println("--- With less than ---")
res = maximum(TrackedFloat32.([1, 5, 4, NaN, 4])).val
println("Result: $(res)")
println()
println("--- With builtin max ---")
res2 = maximum2(TrackedFloat32.([1, 5, 4, NaN, 4])).val
println("Result: $(res2)")
tf_flush_logs()
This code shows two different implementations of a max-element function.
One uses the builtin <
operator and the other uses Julia's max
function. When encountering a NaN
the <
will "kill" it (always returning false
) and the max
function will "prop" it (always returning back the NaN
).
Note that the result of this program is wrong: instead of the true maximum value of the list (5.0
) getting returned, the bad version returns 4.0
!
We can see this in the log that produced by TrackedFloats when running this file.
[NaN] check_error at /Users/ashton/.julia/dev/TrackedFloats/src/TrackedFloat.jl:11
< at /Users/ashton/.julia/dev/TrackedFloats/src/TrackedFloat.jl:214
maximum at /Users/ashton/Research/TrackedFloatsExamples/max_example.jl:0
top-level scope at /Users/ashton/Research/TrackedFloatsExamples/max_example.jl:20
eval at ./boot.jl:370
include_string at ./loading.jl:1899
_include at ./loading.jl:1959
include at ./Base.jl:457
exec_options at ./client.jl:307
_start at ./client.jl:522
…
This is an example of a program where two different implementations can result in a different answer when dealing with NaN
in the input. In a larger program, the presence of NaN
can produce incorrect results.
This tool may be useful for debugging those sorts of issues.
using TrackedFloats
; you may want to include functions like tf_enable_nan_injection
or tf_config_logger
or the like. (See below for more details.)TrackedFloatN
as you can.TrackedFloats should take care of the rest!
Digging into step 2, there are two things that you can customize after initialization:
The logger
Determines what and how events are captured and logged.
The injector
Optional—default is to not inject. If you want to try injecting NaNs to fuzz your code, this is where you control when that happens.
# Set log file basename to "whatever"; all log files have the timestamp prepended
tf_config_logger(filename="whatever")
# There are three kinds of events that we log:
# - `:gen` → when a NaN gets created from non-NaN arguments
# - `:prop` → when a NaN argument leads to a NaN result
# - `:kill` → when a NaN argument does *not* lead to a NaN result
#
# If logs are too noisy, we can disable some or all of the logs. For example,
# here we disable everything but NaN generation logging:
tf_exclude_stacktrace([:prop,:kill])
Keyword arguments for tf_config_logger
:
filename::String
Basename of the file to write logs to.
Constructors automatically prefix the timestamp to the beginning of this basename so the logs are grouped together chronologically.
buffersize::Int
Number of logs to buffer in memory before writing to file.
Defaults to 1000. Decrease if you are crashing without getting the logs that you need.
printToStdOut::Bool
Whether or not to write logs to STDOUT; defaults to false
.
cstg::Bool
Write logs in CSTG format.
cstgLineNum::Bool
Include the line number in CSTG output.
cstgArgs::Bool
Include arguments to functions in CSTG output.
maxFrames::Union{Int,Unbounded}
Maximum number of frames to print per even in the logs; defaults to Unbounded
.
maxLogs::Union{Int,Unbounded}
Maximum number of events to log; defaults to Unbounded
.
exclusions::Array{Symbol}
Events to not log; defaults to [:prop]
.
TrackedFloats can inject NaN
s at random points in your program to help you find places where you might not be handling exceptional values properly: this technique can help you find NaN
kills before they happen in a production environment.
# Inject 2 NaNs
tf_enable_nan_injection(2)
# Inject 2 NaNs, except when in the function "nope" in "way.jl"
tf_enable_nan_injection(n_inject=2, functions=[FunctionRef("nope", "way.jl")])
# Enable recording of injections
tf_record_injection("tf_recording") # this is just the file basename; will have timestamp prepended
# Enable recording playback
tf_reply_injection("20230530T145830-tf_recording.txt")
Keyword arguments for tf_config_injector
:
active::Boolean
inject only if true
n_inject::Int
maximum number of NaNs to inject; gets decremented every time
a NaN gets injected
odds::Int
inject a NaN with 1:odds probability—higher value → rarer to
inject
functions::Array{FunctionRef}
if given, only inject NaNs when within these
functions; default is to not discriminate on functions
libraries::Array{String}
if given, only inject NaNs when within this library.
record::String
if given, record injection invents in a way that can be
replayed later with the replay
argument.
replay::String
if given, ignore all previous directives and use this file
for injection replay.
functions
and libraries
work together as a union: i.e. the set of possible NaN
injection points is a union of the places matched by functions
and libraries
.
1.0 ^ NaN → 1.0
NaN ^ 0.0 → 1.0
1.0 < NaN → false
1.0 > NaN → false
1.0 <= NaN → false
1.0 >= NaN → false
Most of the time comparison operators are what kill a NaN. But ^
can kill NaNs too.
TrackedFloats allows you to fuzz code and inject NaNs or Infs wherever a TrackedFloat
type is used. Moreover, you can record these injections to rerun injections.
WARNING: it is critical that inputs to the program be exactly the same for recording and replaying to be consistent. The recordings are sensitive to the number of times a floating point operation is hit.
This diagram shows how we decide at a given point whether or not to inject a NaN:
The checks in the purple region cost the most time, so we do those as late as possible.
Sometimes we want to inject NaNs throughout the program. We can create a "recording session" that will before each injection check if that point has been tried before. If it has, we move on and try again at the next injection point.
We can tell TrackedFloats what we consider to be identical injection points. TODO: how do we tell TrackedFloats what we consider to be the same and not the same? Function boundaries?
During recording and replaying, we increment a counter each time a floating point operation happens. This doesn't add much overhead [citation needed] since we're already intercepting each of the floating point calls anyway—but it explains why we need to make sure our programs are deterministic before recording and replaying.
Injection points are saved to a recording file, where each line denotes an injection point. Example:
42, solve.jl, OrdinaryDiffEq::solve OrdinaryDiffEq::do_it Finch::make_it_so
The first field 42
is the injection point, or the nth time a floating point operation was intercepted by TrackedFloats. The second field solve.jl
acts as a little sanity check: this is the first non-TrackedFloats file off of the stack trace. After that comes a list of module names paired with the function on the call stack.
Get the CSTG code.
Run a program that uses TrackedFloats (e.g. from the example repository).
By default, a file with *error_log*
in its name should appear.
Generate a graph using the error log:
./path/to/tracerSum *-program_error_log.txt output_basename
# prints many lines to stdout
Open output_basename.pdf
to see the CSTG.
For more about CSTG, please see the original paper:
Alan Humphrey, Qingyu Meng, Martin Berzins, Diego Caminha Barbosa De Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan: Systematic Debugging Methods for Large-Scale HPC Computational Frameworks. Comput. Sci. Eng. 16(3): 48-56 (2014)
Examples have been moved from this repository to an example repository—this allows us to keep the dependencies in this repository nice and light.
TrackedFloats works on the CPU. If a Julia function calls a GPU kernel, then you can track exceptions inside the GPU execution using our companion tool GPU-FPX developed by Xinyi Li for her PhD. This will allow you to (1) see the exception flows inside the kernel, (2) whether the exceptions got killed inside the kernel, and if the exceptions were present in the return result of the Julia GPU call, then (3) TrackedFloats will show how that exception further flows through the Julia code. You get this full effect by running your Julia Command under LD_PRELOAD
.
For details of LD_PRELOAD
and to obtain and install GPU-FPX, please visit the GPU-FPX repository and ask its authors for assistance if needed. For help on using TrackedFloats in conjunction with this tool, talk to us.
You can run tests one of two ways:
$ julia --project=. test/runtests.jl
or via the Julia shell:
julia> ] # enter the package shell
pkg> activate .
(TrackedFloats) pkg> test
Please report bugs on the TrackedFloats issue tracker on GitHub. Likewise, if you wish to make a code contribution, please fork the repository and open a pull-request on GitHub.
MIT License
Inspired by Sherlogs.jl.
This repository originally lived in Taylor Allred's repository.
If you use this library in your work, please cite us:
(citation forthcoming)
This work was performed under the auspices of the U.S. Department of Energy by the University of Utah under Award DE-SC0022252 and NSF CISE Awards 1956106, 2124100, and 2217154.