hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io
Other
42.84k stars 9.56k forks source link

Too slow to compete in Advent of Code #30087

Closed sharebear closed 2 years ago

sharebear commented 2 years ago

Terraform Version

Terraform v1.0.10
on linux_amd64

Your version of Terraform is out of date! The latest version
is 1.0.11. You can update by downloading from https://www.terraform.io/downloads.html

Terraform Configuration Files

locals {
  input                   = file("${path.module}/input.txt")
  input_as_string_list    = compact(split("\n", local.input))
  number_of_boards        = (length(local.input_as_string_list) - 1) / 5
  board_start_lines       = [for v in range(local.number_of_boards) : 5 * v + 1]
  board_lines             = [for v in local.board_start_lines : slice(local.input_as_string_list, v, v + 5)]
  board_cells_strings     = [for v in local.board_lines : [for line in v : compact(split(" ", line))]]
  board_cells_numbers     = [for v in local.board_cells_strings : [for line in v : [for cell in line : tonumber(cell)]]]
  draws                   = [for token in split(",", local.input_as_string_list[0]) : tonumber(token)]
  individual_draw_results = [for draw in local.draws : [for board in local.board_cells_numbers : [for row in board : [for cell in row : cell == draw ? 1 : 0]]]]
  cumulative_draw_results = [for draw_idx, draw in local.draws : [for board_idx, board in local.board_cells_numbers : [for row_idx, row in board : [for cell_idx, cell in row : sum(
    [for d2 in slice(local.individual_draw_results, 0, draw_idx + 1) : d2[board_idx][row_idx][cell_idx]]
  )]]]]
  # Each board gets a 1 or 0 depending on wether it has a complete row... need to work out columns later.
  draw_board_status = [for draw in local.cumulative_draw_results : [for board in draw : sum([for row in board : sum(row) == 5 ? 1 : 0]) > 0 ? 1 : 0]]
  # Draw gets a 1 if there is a winning board after completion of the draw
  draw_status              = [for draw in local.draw_board_status : sum(draw) > 0 ? 1 : 0]
  winning_draw_index       = index(local.draw_status, 1)
  winning_board_index      = index(local.draw_board_status[local.winning_draw_index], 1)
  winning_unmarked_numbers = flatten([for row_idx, row in local.cumulative_draw_results[local.winning_draw_index][local.winning_board_index] : [for cell_idx, cell in row : local.board_cells_numbers[local.winning_board_index][row_idx][cell_idx] if cell == 0]])
  score                    = sum(local.winning_unmarked_numbers) * local.draws[local.winning_draw_index]
}

output "result_1" {
  value = local.score
}

Input files can be found here https://gitlab.com/sharebear/adventofcode2021/-/tree/main/Day4

Debug Output

Expected Behavior

Terraform should complete running within sensible amount of time.

Actual Behavior

I killed the process after running > 2 hours.

Steps to Reproduce

  1. terraform init
  2. terraform refresh

Additional Context

I'm fully aware that this is a complete misuse of Terraform but maybe there are performance enhancements here that could be of use for regular usage too.

I've started trying to solve Advent Of Code 2021 using terraform, my code can be found in this repo https://gitlab.com/sharebear/adventofcode2021 . Experience was a fun challenge but I did note that when running with the full input instead of just the small example things took quite some time. For example Day 2 also took just under 10 seconds to complete.

When I got to Day 4, I have a solution that's not quite complete (doesn't detect vertical bingo lines at the time of posting) but works fine with the example.txt input. Unfortunately, when running with the full input it's taking too long. I lost patience after 2 hours and cancelled the process so that I could create this report before I head out for the evening, I will probably leave it to run overnight to see if it can complete. Or have I somehow hit some kind of deadlock?

I might try and find some time tomorrow to comment the code to make it clearer what I was trying to achieve in each step.

Just to be clear, I'm totally prepared for this to be closed as it's a complete misuse of Terraform but personally I do feel like it's an interesting stress-test.

References

sharebear commented 2 years ago

FYI: Running overnight it did complete

Executed in 518,01 mins

apparentlymart commented 2 years ago

Thanks for sharing this, @sharebear!

As you noted, the Terraform language isn't intended for general computation and so raw performance for CPU-bound work is not something we typically prioritize; normal use of Terraform is primarily I/O bound by network requests to remote APIs, and so "normal" computation in a Terraform configuration is typically fast enough to not be the significant bottleneck.

As you say though, it can be interesting to use examples like these to see if there is some low-hanging fruit somewhere in the language runtime. I can't promise that we'll spend a lot of time on this, since of course there's an opportunity cost vs. other work we could be doing, but if a small profiling effort turns up something that is easy to improve without hurting maintainability then hopefully we can improve it.

With that said, I would not expect that such an improvement would come in time to help with continued participation in this year's Advent of Code.

Thanks again!

sharebear commented 2 years ago

Thanks for the response @apparentlymart!

I can't promise that we'll spend a lot of time on this, since of course there's an opportunity cost vs. other work we could be doing

I totally understand, but I can dream that perhaps someone gets bored and curious during the holiday down-time ;)

With that said, I would not expect that such an improvement would come in time to help with continued participation in this year's Advent of Code.

Of course not, but I wouldn't say no to a highly experimental branch I had to build myself.

Quick, completely unscientific, black box observation (I've never looked to the internals of Terraform itself), I see lots of processes being spun up when running and a quick peek with strace revealed lots of mutex stuff that I assume is coordination between the processes. I can understand why a for_each on a resource/datasource would benefit from parallelisation into multiple processes due to the I/O bound nature of the work, but why would a for expression do the same? (Apologies if my assumption here of what is happening is incorrect).

jbardin commented 2 years ago

@sharebear, since this configuration contains no resources, and hence no external providers, the only process you should be seeing is Terraform itself. The CLI will still spawn a child process in this version for presentation purposes, but there is still only one main terraform process. What you may be seeing are the OS threads started by the Go runtime itself.

A high level of concurrency is inherit in the design of Terraform, since in most use cases there are going to be many operations on resources which are relatively slow to complete, and can often be done in parallel. This design is reflected here by each local value being an independent node in the dependency graph, which may be evaluated concurrently if there is no blocking dependency. A trivial example would be:

locals {
  a = [ for i in range(100) : i ]
  b = [ for i in range(100) : i ]
  c = zipmap(local.a, local.b)
}

Where local.a and local.b would be evaluated concurrently, while local.c waits for the two results.

While this is obviously not much of a useful optimization when no resources are involved; locals, variables and outputs may be composed of any number of resources and fed into other resources, so it's still very useful in the overall design to maintain concurrency. The individual expressions are still (normally) evaluated in a single thread of execution (though there's no reason hcl functions can't make use of concurrency internally too). The overhead of any extra concurrency and synchronization is going to be far outweighed by the fact that the fundamental data structures we use are not optimal for large volumes of data and computation.

apparentlymart commented 2 years ago

Hi @sharebear,

Some members of our team were able to spend a little time profiling this in the meantime, and we've reached the conclusion that the main overhead here arises from the fact that the Terraform language uses arbitrary-precision numbers rather than fixed-range types natively supported by typical CPUs (integers and floating point numbers).

While that additional overhead is generally immaterial for intended uses of Terraform -- local calculations are dominated by API requests over the network -- what you tried here is a case where that overhead is both the bound on performance and also happening at a sufficient scale to cause a particularly degenerate runtime.

In the implementation details here, Terraform is ultimately using the Go standard library's big.Float, which is also the basis of the Go compiler's unified numeric constants, with similar underlying design goals.

The Go language has far greater constraints on what sorts of computation you can perform at compile time, so it's harder to find a degenerate case like this one, but the deepest computations in that library are already optimized to the point of having specialized assembly code for each common CPU architecture (including amd64) and so I expect there isn't a lot we could do to improve performance of this program without some drastic changes to the fundamentals of the Terraform language.

Because the Terraform language is not intended for general computation of this sort, and because the generalization of all numbers into a single type is an intentional design tradeoff for usability, I don't think there's any reasonable path forward to a material performance improvement for this unusual Terraform configuration, and so I'm going to close this issue.

However, I do appreciate you raising it because the exercise of understanding the root cause of this caused us to improve our understanding of the practical edges of the Terraform language's capability and to know specifically that the language is not suited to high-volume numeric work, whereas before we only guessed that to be true.

Thanks again for sharing this!

sharebear commented 2 years ago

Thanks again for the detailed response, and thanks for giving it a proper look. It was a fun experiment and I also learnt a few other things about terraform along the way.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.