grc-iit / ChronoLog

ChronoLog: A High-Performance Storage Infrastructure for Activity and Log Workloads
https://chronolog.dev
BSD 2-Clause "Simplified" License
5 stars 4 forks source link

157 chunk stuck orphan queue #162

Closed ibrodkin closed 1 month ago

ibrodkin commented 1 month ago

This PR contains changes for a few issues that were uncovered in the investigation of the stuck StoryChuns

  1. The original issue of the ingested StoryChunk being forever stuck on the ChronoGrapher's orphanQueue insted of being collected by the appropriate StoryPipeline turned out being the timing issue. When the story producing client is short lived and sends the request for Story Release as soon as it's done generating events this causes the StoryGrapher to retire the appropriate StoryPipeline before the partial StoryChunk is received by the ChronoGrapher. Extending the ChronoGrapher acceptance window fixes this issue. Default acceptance window for ChronoGrapher is 300 secs , hardcoded for now, will be made configurable as part of Issue #155
  2. After (1) was handled the issue of StoryChunk memory corruption was exposed. GrapherRecordingServiceRDMA was creating deserialized StoryChunk object on a stack (local variable in the recording function) and then passing a pointer to the locally created StroyChunk to the IngestionQueue for further processing. the lifespan of the locally created StoryChunk was not guaranteed by the time the DataStore was operating on this pointer. StoryChunk in the RecordingService should instead be created on the heap , then ownership of the partial StoryChunk pointer should be released to the IngestionQueue, then the StoryPipeline that merges this partial StoryChunk into the pipeline and then frees the memory accordingly.
  3. I've added uniform debug messages throughout the code to track individual Story & Chunks accumulation, merging , and proceeding through the ChronoGrapher DataStore
  4. StoryChunk merging logic needed plenty of changes, they are in PR for issue # 125. They are also included in this PR
  5. CSVFileExtractor was mangling up the csv filename , so this needed a tweak as well
  6. the last piece in this PR is a fix for uint64_t to uint16_t truncation of acceptance time in StoryPipeline.h

With all these changes I can run multiple chrono_keepers and chrono_grapher in my local environment and observe the story accumulation through the chrono_keepers & chrono_grapher as expected