numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

2023-12-13 00:27:07,476 INFO [Executor task launch worker for task 577 #205

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

2023-12-13 00:27:07,477 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.common.util.collection.ExternalSpillableMap:Estimated Payload size => 2504
2023-12-13 00:27:07,478 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.common.util.collection.ExternalSpillableMap:New Estimated Payload size => 2845
2023-12-13 00:27:09,814 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:starting to buffer records
2023-12-13 00:27:09,821 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:starting consumer thread
2023-12-13 00:27:09,855 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:starting to buffer records
2023-12-13 00:27:09,918 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:starting consumer thread
2023-12-13 00:27:19,120 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.io.HoodieMergeHandle:Number of entries in MemoryBasedMap => 920285, Total size in bytes of MemoryBasedMap => 2618210917, Number of entries in BitCaskDiskMap => 0, Size of file spilled to disk => 0
2023-12-13 00:27:19,120 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.io.HoodieMergeHandle:partitionPath:tenant=aaaaaa/date=20231213, fileId to be merged:3d4538da-9810-445e-84ef-63b03719092b-0
2023-12-13 00:27:19,134 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.io.HoodieMergeHandle:Merging new data into oldPath <s3://some-s3-bucket/hudi/visibility=private/schema=scwx.process/tenant=aaaaaa/date=20231213/3d4538da-9810-445e-84ef-63b03719092b-0_616-45168-16228077_20231213002302278.parquet>, as newPath <s3://some-s3-bucket/hudi/visibility=private/schema=scwx.process/tenant=aaaaaa/date=20231213/3d4538da-9810-445e-84ef-63b03719092b-0_577-45181-16233208_20231213002634231.parquet>
2023-12-13 00:27:19,326 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:finished buffering records
2023-12-13 00:27:19,330 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:Queue Consumption is done; notifying producer threads
2023-12-13 00:27:19,457 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.table.marker.DirectWriteMarkers:Creating Marker Path=<s3://some-s3-bucket/hudi/visibility=private/schema=scwx.process/.hoodie/.temp/20231213002634231/tenant=aaaaaa/date=20231213/3d4538da-9810-445e-84ef-63b03719092b-0_577-45181-16233208_20231213002634231.parquet.marker.MERGE>
2023-12-13 00:27:19,524 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.table.marker.DirectWriteMarkers:[direct] Created marker file <s3://some-s3-bucket/hudi/visibility=private/schema=scwx.process/.hoodie/.temp/20231213002634231/tenant=aaaaaa/date=20231213/3d4538da-9810-445e-84ef-63b03719092b-0_577-45181-16233208_20231213002634231.parquet.marker.MERGE> in 389 ms
2023-12-13 00:27:19,615 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:starting to buffer records
2023-12-13 00:27:19,616 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:starting consumer thread
2023-12-13 00:27:25,880 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:finished buffering records
2023-12-13 00:27:25,880 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:Queue Consumption is done; notifying producer threads
2023-12-13 00:27:32,378 INFO [producer-thread-1] org.apache.hudi.common.util.queue.IteratorBasedQueueProducer:finished buffering records
2023-12-13 00:27:32,381 INFO [consumer-thread-1] org.apache.hudi.common.util.queue.BoundedInMemoryExecutor:Queue Consumption is done; notifying producer threads
2023-12-13 00:29:13,455 INFO [Executor task launch worker for task 577.0 in stage 45181.0 (TID 16233208)] org.apache.hudi.io.HoodieMergeHandle:MergeHandle for partitionPath tenant=aaaaaa/date=20231213 fileID 3d4538da-9810-445e-84ef-63b03719092b-0, took 125979 ms.```
torvalds-dev[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev[bot] commented 10 months ago

The log provided is from Apache Hudi, a data management framework used to manage large analytical datasets on top of distributed storage systems like Hadoop HDFS or cloud object stores.

The log shows the process of merging new data into an existing dataset. Here's a breakdown of the process:

  1. The system starts by setting the maximum memory per partition merge to 4294967296 bytes (4GB).
  2. It then estimates the payload size to be 2504 bytes, which is later updated to 2845 bytes.
  3. The producer thread starts buffering records, and the consumer thread starts consuming them. This process is repeated a few times.
  4. The system then logs the number of entries in the MemoryBasedMap and the total size in bytes of the MemoryBasedMap. It also logs the number of entries in the BitCaskDiskMap and the size of the file spilled to disk.
  5. The system identifies the partition path and the file ID to be merged.
  6. The system merges new data into the old path and creates a new path for the merged data.
  7. The system creates a marker file in the new path.
  8. The merge handle for the partition path and file ID takes 125979 ms to complete.

The relevant files to this issue are:

  1. IOUtils.java: This file contains the IOUtils class, which includes methods for calculating the maximum memory allowed for merge and compaction tasks. It also contains methods for getting the maximum memory per partition merge and compaction.

  2. TestSparkIOUtils.java: This file contains the TestSparkIOUtils class, which includes tests for the methods in the IOUtils class.

Suggested actions:

  1. Check the memory configuration: The log shows that the system is setting the maximum memory per partition merge to 4GB. If there are memory-related issues, you might want to adjust this setting.

  2. Monitor the producer and consumer threads: The log shows that the producer and consumer threads are starting and finishing multiple times. If there are performance issues, you might want to monitor these threads to see if they are running as expected.

  3. Check the merge process: The log shows that the merge handle for a specific partition path and file ID is taking over 2 minutes to complete. If this is slower than expected, you might want to investigate why this is happening.