tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

Large regression in build times #10360

Open TT-billteng opened 2 months ago

TT-billteng commented 2 months ago

From this commit: https://github.com/tenstorrent/tt-metal/actions/runs/9899996339

image

Previous commit: https://github.com/tenstorrent/tt-metal/actions/runs/9899416070

image

ayerofieiev-tt commented 1 month ago

I am still on it. Definitely want to improve this

mywoodstock commented 1 month ago

hello! is there an eta on this? its quite time consuming, particularly during debug :(

ayerofieiev-tt commented 1 month ago

Working on it now. I expect this to be improved before July 31

ayerofieiev-tt commented 1 month ago

Analysis

**** Time summary:
Compilation (1050 times):
  Parsing (frontend):        10253.1 s
  Codegen & opts (backend):   5081.1 s

**** Files that took longest to parse (compiler frontend):
140304 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/pybind11/__init__.cpp.o
 89999 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/unary_backward/device/unary_backward_op.cpp.o
 71581 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/deprecated/tt_dnn/op_library/composite/composite_ops.cpp.o
 63373 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/unary/device/unary_composite_op.cpp.o
 50798 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/binary_backward/device/binary_backward_op.cpp.o
 50004 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/binary/device/binary_composite_op.cpp.o
 37072 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/deprecated/tt_dnn/op_library/optimizer/optimizer_ops.cpp.o
 36161 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/deprecated/tt_dnn/op_library/complex/complex_ops.cpp.o
 35906 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/ternary_backward/device/ternary_backward_op.cpp.o
 35794 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/complex_unary/device/complex_unary_op.cpp.o

**** Files that took longest to codegen (compiler backend):
227094 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/pybind11/__init__.cpp.o
162286 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/unary_backward/device/unary_backward_op.cpp.o
 91557 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/unary/device/unary_composite_op.cpp.o
 79705 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/deprecated/tt_dnn/op_library/composite/composite_ops.cpp.o
 63017 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/binary/device/binary_composite_op.cpp.o
 62228 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/binary_backward/device/binary_backward_op.cpp.o
 43643 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/tensor/tensor_impl.cpp.o
 31801 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/deprecated/tt_lib/csrc/tt_lib_bindings_tensor.cpp.o
 25349 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/eltwise/ternary/ternary_composite_op.cpp.o
 25329 ms: build_Release/ttnn/CMakeFiles/ttnn.dir/cpp/ttnn/operations/experimental/reduction/argmax/argmax.cpp.o

**** Templates that took longest to instantiate:
376251 ms: nlohmann::basic_json<>::parse<const char *> (484 times, avg 777 ms)
318066 ms: nlohmann::detail::parser<nlohmann::basic_json<>, nlohmann::detail::i... (484 times, avg 657 ms)
304649 ms: tt::tt_metal::operation::launch_op<(lambda at ../ttnn/cpp/ttnn/decor... (1347 times, avg 226 ms)
275091 ms: nlohmann::detail::parser<nlohmann::basic_json<>, nlohmann::detail::i... (484 times, avg 568 ms)
250670 ms: fmt::detail::vformat_to<char> (713 times, avg 351 ms)
235442 ms: std::__function::__func<(lambda at ../ttnn/cpp/ttnn/run_operation_in... (5844 times, avg 40 ms)
230447 ms: std::__function::__func<(lambda at ../ttnn/cpp/ttnn/run_operation_in... (5844 times, avg 39 ms)
228224 ms: tt::stl::reflection::Attribute::Attribute<const tt::tt_metal::Memory... (161 times, avg 1417 ms)
219988 ms: fmt::detail::value<fmt::basic_format_context<fmt::appender, char>>::... (161 times, avg 1366 ms)
219941 ms: fmt::formatter<tt::tt_metal::MemoryConfig>::format (161 times, avg 1366 ms)
219547 ms: tt::stl::reflection::operator<<<tt::tt_metal::MemoryConfig> (161 times, avg 1363 ms)
217971 ms: fmt::format<const tt::tt_metal::MemoryConfig &> (161 times, avg 1353 ms)
205979 ms: std::make_shared<std::function<void (tt::tt_metal::Device *)>, (lamb... (1948 times, avg 105 ms)
205149 ms: std::allocate_shared<std::function<void (tt::tt_metal::Device *)>, s... (1948 times, avg 105 ms)
202809 ms: std::__shared_ptr_emplace<std::function<void (tt::tt_metal::Device *... (1948 times, avg 104 ms)
196015 ms: std::make_shared<std::function<void ()>, (lambda at ../ttnn/cpp/ttnn... (1948 times, avg 100 ms)
195277 ms: std::allocate_shared<std::function<void ()>, std::allocator<std::fun... (1948 times, avg 100 ms)
193894 ms: std::__shared_ptr_emplace<std::function<void ()>, std::allocator<std... (1948 times, avg 99 ms)
189960 ms: std::function<void (tt::tt_metal::Device *)>::function<(lambda at ..... (1948 times, avg 97 ms)
189060 ms: std::__function::__value_func<void (tt::tt_metal::Device *)>::__valu... (1948 times, avg 97 ms)
187463 ms: std::__function::__value_func<void (tt::tt_metal::Device *)>::__valu... (1948 times, avg 96 ms)
186002 ms: std::function<void ()>::function<(lambda at ../ttnn/cpp/ttnn/run_ope... (1948 times, avg 95 ms)
185125 ms: std::__function::__value_func<void ()>::__value_func<(lambda at ../t... (1948 times, avg 95 ms)
184167 ms: nlohmann::basic_json<>::basic_json (2420 times, avg 76 ms)
183495 ms: std::__function::__value_func<void ()>::__value_func<(lambda at ../t... (1948 times, avg 94 ms)
170958 ms: tt::stl::reflection::operator<<<tt::tt_metal::ShardSpec> (323 times, avg 529 ms)
144061 ms: tt::tt_metal::operation::run_with_autoformat<tt::tt_metal::EltwiseBi... (57 times, avg 2527 ms)
144039 ms: tt::tt_metal::operation::DeviceOperation<>::DeviceOperation<tt::tt_m... (57 times, avg 2527 ms)
141425 ms: std::__function::__alloc_func<(lambda at ../ttnn/cpp/ttnn/run_operat... (5844 times, avg 24 ms)
139156 ms: std::__function::__alloc_func<(lambda at ../ttnn/cpp/ttnn/run_operat... (5844 times, avg 23 ms)

**** Template sets that took longest to instantiate:
1398172 ms: std::function<$>::function<$> (14534 times, avg 96 ms)
1391572 ms: std::__function::__value_func<$>::__value_func<$> (14534 times, avg 95 ms)
1180167 ms: std::__function::__func<$>::__func (14534 times, avg 81 ms)
1058643 ms: std::__function::__alloc_func<$>::__alloc_func (43602 times, avg 24 ms)
897001 ms: std::make_shared<$> (10473 times, avg 85 ms)
891343 ms: std::allocate_shared<$> (10473 times, avg 85 ms)
852605 ms: std::__shared_ptr_emplace<$>::__shared_ptr_emplace<$> (10471 times, avg 81 ms)
823484 ms: std::forward_as_tuple<$> (64860 times, avg 12 ms)
703333 ms: tt::tt_metal::operation::DeviceOperation<$>::DeviceOperation<$> (713 times, avg 986 ms)
633144 ms: std::__function::__func<$>::__clone (29068 times, avg 21 ms)
602204 ms: fmt::format<$> (7934 times, avg 75 ms)
550662 ms: fmt::detail::value<$>::format_custom_arg<$> (1510 times, avg 364 ms)
549691 ms: fmt::formatter<$>::format (1485 times, avg 370 ms)
541627 ms: std::tuple<$> (75455 times, avg 7 ms)
532646 ms: std::vector<$>::__swap_out_circular_buffer (9173 times, avg 58 ms)
528390 ms: tt::tt_metal::operation::run<$> (693 times, avg 762 ms)
508982 ms: magic_enum::detail::values<$> (1338 times, avg 380 ms)
504756 ms: magic_enum::detail::valid_count<$> (1338 times, avg 377 ms)
488879 ms: magic_enum::enum_name<$> (1241 times, avg 393 ms)
484959 ms: tt::stl::reflection::get_attributes<$> (715 times, avg 678 ms)
475326 ms: magic_enum::enum_index<$> (1241 times, avg 383 ms)
472541 ms: std::__compressed_pair<$>::__compressed_pair<$> (217047 times, avg 2 ms)
452246 ms: tt::tt_metal::operation::launch_op<$> (1948 times, avg 232 ms)
440426 ms: tt::stl::reflection::Attribute::Attribute<$> (1219 times, avg 361 ms)
421177 ms: magic_enum::detail::is_valid<$> (340212 times, avg 1 ms)
411554 ms: tf::Executor::run_n<$> (1904 times, avg 216 ms)
410384 ms: tf::Executor::run_until<$> (1904 times, avg 215 ms)
376251 ms: nlohmann::basic_json<$>::parse<$> (484 times, avg 777 ms)
371776 ms: tf::Topology::Topology<$> (1904 times, avg 195 ms)
369926 ms: std::__uninitialized_allocator_move_if_noexcept<$> (9159 times, avg 40 ms)

**** Functions that took longest to compile:
  2007 ms: ttnn::all_gather_multi_core_with_workers(tt::tt_metal::Tensor const&... (../ttnn/cpp/ttnn/operations/ccl/all_gather/device/multi_core/all_gather_op_multi_core.cpp)
  1260 ms: reuse_mcast_optimized_helpers::create_program_mcast_in0_in1(tt::tt_m... (../ttnn/cpp/ttnn/operations/matmul/device/multi_core_reuse_mcast_2d_optimized/bmm_op_multi_core_reuse_mcast_2d_optimized.cpp)
  1114 ms: ttnn::operations::normalization::layernorm_multi_core_sharded(tt::tt... (../ttnn/cpp/ttnn/operations/normalization/layernorm/device/multi_core/layernorm_op_multi_core.cpp)
   872 ms: ttnn::operations::normalization::groupnorm_multi_core_sharded(tt::tt... (../ttnn/cpp/ttnn/operations/normalization/groupnorm/device/multi_core/groupnorm_op_multi_core.cpp)
   867 ms: create_program_mcast_in0_in1(tt::tt_metal::Device*, MathFidelity, tt... (../tests/tt_metal/tt_metal/perf_microbenchmark/old/matmul/matmul_global_l1.cpp)
   827 ms: tt::tt_metal::multi_core_optimized_conv_(tt::tt_metal::Tensor const&... (../ttnn/cpp/ttnn/operations/conv2d/device/multi_core_optimized_conv/optimized_conv_op.cpp)
   696 ms: tt::tt_metal::multi_core_optimized_conv_sharded_v2_impl(tt::tt_metal... (../ttnn/cpp/ttnn/operations/conv2d/device/multi_core_optimized_conv_sharded/optimized_conv_op_sharded_v2.cpp)
   672 ms: tt::tt_metal::multi_core_optimized_conv_sharded_(tt::tt_metal::Tenso... (../ttnn/cpp/ttnn/operations/conv2d/device/multi_core_optimized_conv_sharded/optimized_conv_op_sharded.cpp)
   576 ms: tt::tt_metal::detail::convert_numpy_tensor_to_tt_tensor(pybind11::ha... (../ttnn/cpp/ttnn/deprecated/tt_lib/csrc/tt_lib_bindings_tensor_pytensor.cpp)
   557 ms: reuse_mcast_1d_optimized_helpers::create_program_mcast_in0(tt::tt_me... (../ttnn/cpp/ttnn/operations/matmul/device/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp)
   543 ms: reuse_dram_sharded_optimized_helpers::create_program_dram_sharded(tt... (../ttnn/cpp/ttnn/operations/matmul/device/multi_core_reuse_mcast_dram_sharded_optimized/bmm_op_multi_core_reuse_dram_sharded_optimized.cpp)
   541 ms: tt::operations::primary::sdpa_decode_multi_core(tt::tt_metal::Tensor... (../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/sdpa/multi_core/sdpa_decode_op_multi_core.cpp)
   469 ms: ttnn::operations::creation::py_module(pybind11::module_&) (../ttnn/cpp/pybind11/__init__.cpp)
   446 ms: tt::tt_metal::untilize_with_halo_multi_core_s2(tt::tt_metal::Tensor ... (../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/untilize/untilize_with_halo_op.cpp)
   443 ms: tt::tt_metal::untilize_with_halo_multi_core_s1(tt::tt_metal::Tensor ... (../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/untilize/untilize_with_halo_op.cpp)
   440 ms: AllGatherUtils_OutputTensorShardAddrGenArgGenerator_ComputeWorkerDes... (../tests/tt_eager/ops/ccl/test_all_gather_utils.cpp)
   434 ms: main (../tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_tunnel_2cq.cpp)
   404 ms: tt::tt_metal::TensorModule(pybind11::module_&) (../ttnn/cpp/ttnn/deprecated/tt_lib/csrc/tt_lib_bindings_tensor.cpp)
   397 ms: ttnn::ccl::reduce_scatter_detail::reduce_scatter_with_workers(std::_... (../ttnn/cpp/ttnn/operations/ccl/reduce_scatter/device/host/reduce_scatter_full_worker_grid.cpp)
   391 ms: tt::tt_metal::Device::setup_tunnel_for_remote_devices() (../tt_metal/impl/device/device.cpp)
   385 ms: reuse_mcast_1d_optimized_helpers::create_program_mcast_in1(tt::tt_me... (../ttnn/cpp/ttnn/operations/matmul/device/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp)
   376 ms: tt::tt_metal::EnqueueProgramCommand::assemble_device_commands(bool, ... (../tt_metal/impl/dispatch/command_queue.cpp)
   371 ms: main (../tests/tt_metal/tt_metal/test_bcast.cpp)
   339 ms: create_program(tt::tt_metal::Device*, tt::DataFormat, MathFidelity, ... (../tests/tt_metal/tt_metal/perf_microbenchmark/1_compute_mm/test_compute_mm.cpp)
   327 ms: main (../tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_tunnel_1cq.cpp)
   314 ms: main (../tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_uni_tunnel_single_chip.cpp)
   310 ms: tt_ClusterDescriptor::get_ethernet_link_coord_distance(std::__1::tup... (../tt_metal/third_party/umd/device/tt_cluster_descriptor.cpp)
   291 ms: tt::tt_metal::sliding_window::generate_halo_kernel_config_tensors(st... (../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/sliding_window_op_infra/sliding_window.cpp)
   286 ms: tt_ClusterDescriptor::load_ethernet_connections_from_connectivity_de... (../tt_metal/third_party/umd/device/tt_cluster_descriptor.cpp)
   277 ms: tt::tt_metal::detail::TensorModuleCompositeOPs(pybind11::module_&) (../ttnn/cpp/ttnn/deprecated/tt_lib/csrc/tt_lib_bindings_tensor_composite_ops.cpp)

**** Function sets that took longest to compile / optimize:
 66733 ms: fmt::v8::appender fmt::v8::detail::write_int_noinline<$>(fmt::v8::ap... (1383 times, avg 48 ms)
 28475 ms: fmt::v8::detail::format_dragon(fmt::v8::detail::fp, bool, int, fmt::... (461 times, avg 61 ms)
 22415 ms: fmt::v8::appender fmt::v8::detail::do_write_float<$>(fmt::v8::append... (922 times, avg 24 ms)
 18649 ms: void fmt::v8::detail::vformat_to<$>(fmt::v8::detail::buffer<$>&, fmt... (461 times, avg 40 ms)
 13723 ms: fmt::v8::detail::dragonbox::decimal_fp<$> fmt::v8::detail::dragonbox... (461 times, avg 29 ms)
 12445 ms: void fmt::v8::detail::vformat_to<$>(fmt::v8::detail::buffer<$>&, fmt... (461 times, avg 26 ms)
 12202 ms: int fmt::v8::detail::format_float<$>(double, int, fmt::v8::detail::f... (461 times, avg 26 ms)
 11317 ms: fmt::v8::appender fmt::v8::detail::write_significand<$>(fmt::v8::app... (461 times, avg 24 ms)
 11114 ms: fmt::v8::appender fmt::v8::detail::do_write_float<$>(fmt::v8::append... (461 times, avg 24 ms)
 10770 ms: char fmt::v8::detail::write_padded<$>(char, fmt::v8::basic_format_sp... (922 times, avg 11 ms)
 10092 ms: char const* fmt::v8::detail::parse_replacement_field<$>(char const*,... (461 times, avg 21 ms)
  9133 ms: fmt::v8::basic_memory_buffer<$>::grow(unsigned long) (1383 times, avg 6 ms)
  8534 ms: int fmt::v8::detail::snprintf_float<$>(long double, int, fmt::v8::de... (461 times, avg 18 ms)
  8518 ms: char const* fmt::v8::detail::do_parse_arg_id<$>(char const*, char co... (936 times, avg 9 ms)
  8142 ms: int fmt::v8::detail::snprintf_float<$>(double, int, fmt::v8::detail:... (461 times, avg 17 ms)
  7829 ms: fmt::v8::detail::dragonbox::decimal_fp<$> fmt::v8::detail::dragonbox... (461 times, avg 16 ms)
  7590 ms: void std::__1::__hash_table<$>::__do_rehash<$>(unsigned long) (752 times, avg 10 ms)
  7397 ms: fmt::v8::appender fmt::v8::detail::write<$>(fmt::v8::appender, __int... (461 times, avg 16 ms)
  6512 ms: fmt::v8::appender fmt::v8::detail::write<$>(fmt::v8::appender, unsig... (461 times, avg 14 ms)
  6410 ms: fmt::v8::appender fmt::v8::detail::do_write_float<fmt::v8::appender,... (461 times, avg 13 ms)
  6381 ms: void fmt::v8::detail::vformat_to<$>(fmt::v8::detail::buffer<$>&, fmt... (461 times, avg 13 ms)
  6153 ms: fmt::v8::appender fmt::v8::detail::write_padded<$>(fmt::v8::appender... (461 times, avg 13 ms)
  5891 ms: tt::assert::backtrace_to_string(int, int, std::__1::basic_string<$> ... (440 times, avg 13 ms)
  5588 ms: fmt::v8::appender fmt::v8::detail::write_int_localized<$>(fmt::v8::a... (461 times, avg 12 ms)
  5526 ms: fmt::v8::appender fmt::v8::detail::fill<$>(fmt::v8::appender, unsign... (461 times, avg 11 ms)
  5320 ms: fmt::v8::appender fmt::v8::detail::write<$>(fmt::v8::appender, long ... (461 times, avg 11 ms)
  5289 ms: fmt::v8::appender fmt::v8::detail::write_ptr<$>(fmt::v8::appender, u... (461 times, avg 11 ms)
  5263 ms: fmt::v8::appender fmt::v8::detail::write_padded<$>(fmt::v8::appender... (461 times, avg 11 ms)
  5015 ms: fmt::v8::detail::dragonbox::decimal_fp<$> fmt::v8::detail::write_pad... (922 times, avg 5 ms)
  5015 ms: char const* fmt::v8::detail::do_parse_arg_id<$>(char const*, char co... (461 times, avg 10 ms)

**** Expensive headers:
1325140 ms: ../tt_metal/host_api.hpp (included 196 times, avg 6760 ms), included via:
  60x: <direct include>
  34x: tt_metal.hpp command_queue.hpp 
  14x: device_fixture.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp command_queue.hpp 
  9x: common_fixture.hpp tt_metal.hpp command_queue.hpp 
  7x: basic_fixture.hpp 
  ...

1038242 ms: ../tt_metal/impl/program/program.hpp (included 199 times, avg 5217 ms), included via:
  60x: host_api.hpp 
  33x: tt_metal.hpp command_queue.hpp host_api.hpp 
  14x: device_fixture.hpp host_api.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp 
  9x: common_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp 
  7x: core_coord_fixture.hpp host_api.hpp 
  ...

1037984 ms: ../tt_metal/impl/device/device.hpp (included 205 times, avg 5063 ms), included via:
  58x: host_api.hpp program.hpp circular_buffer.hpp 
  28x: tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp 
  14x: device_fixture.hpp host_api.hpp program.hpp circular_buffer.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp 
  9x: common_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp 
  7x: core_coord_fixture.hpp host_api.hpp program.hpp circular_buffer.hpp 
  ...

898947 ms: ../tt_metal/detail/tt_metal.hpp (included 428 times, avg 2100 ms), included via:
  103x: <direct include>
  47x: run_operation.hpp run_operation_inl.hpp 
  18x: device_fixture.hpp 
  11x: command_queue_fixture.hpp 
  10x: common_fixture.hpp 
  10x: moreh_sum_op.hpp run_operation.hpp run_operation_inl.hpp 
  ...

824542 ms: ../tt_metal/impl/buffers/circular_buffer.hpp (included 199 times, avg 4143 ms), included via:
  59x: host_api.hpp program.hpp 
  33x: tt_metal.hpp command_queue.hpp host_api.hpp program.hpp 
  14x: device_fixture.hpp host_api.hpp program.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp 
  9x: common_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp 
  7x: core_coord_fixture.hpp host_api.hpp program.hpp 
  ...

642419 ms: ../tt_metal/jit_build/build.hpp (included 209 times, avg 3073 ms), included via:
  58x: host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  27x: tt_metal.hpp 
  14x: device_fixture.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp 
  9x: common_fixture.hpp tt_metal.hpp 
  7x: dprint_fixture.hpp common_fixture.hpp tt_metal.hpp 
  ...

548012 ms: ../tt_metal/common/core_coord.h (included 217 times, avg 2525 ms), included via:
  57x: host_api.hpp 
  25x: tt_metal.hpp build.hpp 
  14x: device_fixture.hpp host_api.hpp 
  11x: <direct include>
  11x: command_queue_fixture.hpp tt_metal.hpp build.hpp 
  9x: common_fixture.hpp tt_metal.hpp build.hpp 
  ...

505059 ms: ../tt_metal/impl/dispatch/command_queue.hpp (included 430 times, avg 1174 ms), included via:
  92x: tt_metal.hpp 
  47x: run_operation.hpp run_operation_inl.hpp tt_metal.hpp 
  18x: device_fixture.hpp tt_metal.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp 
  10x: common_fixture.hpp tt_metal.hpp 
  10x: moreh_sum_op.hpp run_operation.hpp run_operation_inl.hpp tt_metal.hpp 
  ...

461518 ms: ../ttnn/cpp/ttnn/run_operation.hpp (included 263 times, avg 1754 ms), included via:
  48x: <direct include>
  10x: moreh_sum_op.hpp 
  10x: bcast_op.hpp 
  10x: binary.hpp binary_device_operation.hpp unary_op.hpp 
  9x: nlp_tms.hpp 
  7x: matmul_op.hpp unary_op.hpp 
  ...

460025 ms: ../tt_metal/llrt/tt_cluster.hpp (included 208 times, avg 2211 ms), included via:
  58x: host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  27x: tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  14x: device_fixture.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  11x: command_queue_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  9x: common_fixture.hpp tt_metal.hpp command_queue.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  7x: core_coord_fixture.hpp host_api.hpp program.hpp circular_buffer.hpp device.hpp 
  ...

  done in 18.5s.
ayerofieiev-tt commented 1 month ago

Current state https://github.com/tenstorrent/tt-metal/actions/runs/10202228369

Screenshot 2024-08-01 at 10 30 36 AM
ayerofieiev-tt commented 1 month ago

Over last weeks build time increased from 5-6 minutes to 8-9 minutes and then incrementally to 12-13 minutes.

Why this happened?

The bump from 5 -> 8 happened when we merged tt_eager and ttnn into a single library. I think the main factor is due to the removal of some headers from PCH and that linker now has to crunch through way more data.

The bump from 8 to 12 is related with a continuous migration of operations to ttnn. There are two things about it which makes things worse:

This is a result of -ftime-trace of a single object file and you can see:

Screenshot 2024-07-29 at 3 41 05 PM (1)

At the same time we have a small amount of object files, which means that some processes can't happen in parallel.

What's next?

We aim to get back to at least 8m (CI time). I did attempt to add some headers to PCH but it did not improve the situation. I have a feeling that I missed something. This might be a low hanging fruit.

We are ongoing an effort to reduce the amount of code in headers:

There are more opportunities in the TT-NN infra side which are still tbd, like templates review.

ayerofieiev-tt commented 1 month ago

Got to 8-9 minutes now and pushing more changes.

https://github.com/tenstorrent/tt-metal/actions/runs/10221925271

Screenshot 2024-08-02 at 3 39 47 PM