microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 400 forks source link

ResourceMonitorService does not show Load Information #384

Open vdvarlamov opened 5 years ago

vdvarlamov commented 5 years ago

I have standalone cluster 6.5.664.9590 on Windows Server 2019. In ClusterManifest Enabled ResourceMonitorService

<Section Name="ResourceMonitorService"> <Parameter Name="InstanceCount" Value="-1" /> <Parameter Name="IsEnabled" Value="True" /> </Section> Stateless Service run in Win docker container, as ExclusiveProcess. 1 I need it because it does not work AutoScaling Policies. 2 For example, view Load Information(this is not my screenshot): 48866043-196c9500-edda-11e8-9e1c-f65a2a778009

SF config and app package: https://github.com/vdvarlamov/SF-test

I noticed Warning: "GetResourceUsageAsyncOperation Operation returned SerializationError" and for every containers "Application Service with service Id 0bfceb73-ab70-4f11-9b3d-023249c3ff40 ContainerStatsResponse error SerializationError" in Event Viewer Microsoft-Service Fabric/Admin

mimckitt commented 5 years ago

@dkkapur @athinanthny @masnider any idea on this one? Is there something we might be missing to get this to show up for @vdvarlamov ?

OlegKarasik commented 5 years ago

Hi @vdvarlamov,

You probably don't see Load Information when deploying to Local Cluster because this information is sent to Cluster Manager periodically by an agent (the default it is 5 mins).

You can change the interval in ClusterManifest.xml inside ReconfigurationAgent section:

<Section Name="ReconfigurationAgent">
  <Parameter Name="SendLoadReportInterval" Value="60" />
</Section>

Please see this answer on Stack Overflow for more details.

Here is what I can see for an empty stateless service:

image

Hope this helps.

vdvarlamov commented 5 years ago

I added a parameter "SendLoadReportInterval", but no result. Scr_2 Scr_3 There is always a warning: "GetResourceUsageAsyncOperation Operation returned SerializationError" Any idea on this one?

OlegKarasik commented 5 years ago

Hi @vdvarlamov,

I did a small walk through the Service Fabric source code and here is what I think is happening.

The error you've mentioned:

"GetResourceUsageAsyncOperation Operation returned SerializationError"

... comes from the ProcessActivationManager::ProcessGetResourceUsage method as the result of GetResourceUsageAsyncOperation asynchronous operation (here is where it starts and here is where the error message is).

Unwinding the call chain from ProcessActivationManager::ProcessGetResourceUsage leads us to GetResourceUsageAsyncOperation::OnStart then to ApplicationService::BeginMeasureResourceUsage and then to MeasureResourceUsageAsyncOperation::OnStart.

Inside the MeasureResourceUsageAsyncOperation::OnStart we can see what API call Service Fabric makes to container host:

if (owner_.IsContainerHost)
{
  auto operation = owner_.ActivationManager.containerActivator_->BeginInvokeContainerApi(
    owner_.ContainerDescriptionObj,
    L"GET",
    L"/containers/{id}/stats?stream=false",
    L"application/json",
    L"",
    HostingConfig::GetConfig().ContainerStatsTimeout,
    [this](AsyncOperationSPtr const & operation)
    {
        this->OnContainerApiStatsCompleted(operation, false);
    },
    thisSPtr);
  this->OnContainerApiStatsCompleted(operation, true);
}

The OnContainerApiStatsCompleted method handles the operation completion:

if (!error.IsSuccess())
{
  // ...
}
else
{
  ContainerApiResponse containerApiResponse;
  error = JsonHelper::Deserialize(containerApiResponse, result);

  if (error.IsSuccess())
  {
    ContainerApiResult const & containerApiResult = containerApiResponse.Result();

    if (containerApiResult.Status() == 200)
    {
      ContainerStatsResponse containerStatsResponse;
      error = JsonHelper::Deserialize(containerStatsResponse, containerApiResult.Body());
      if (error.IsSuccess())
      {
        resourceMeasurement_.MemoryUsage = containerStatsResponse.MemoryStats_.MemoryUsage_;
        resourceMeasurement_.TotalCpuTime = containerStatsResponse.CpuStats_.CpuUsage_.TotalUsage_;
        resourceMeasurement_.TimeRead = containerStatsResponse.Read_;
      }
      else
      {
        WriteWarning(
          TraceType_ActivationManager,
          owner_.parentId_,
          "Application Service with service Id {0} ContainerStatsResponse error {1}",
          owner_.appServiceId_,
          error);
      }
    }
    else
    {
      // ...
    }
  }
  else
  {
    // ...
  }
  TryComplete(operation->Parent, error);
  return;
}

In the email thread you've mentioned one more error:

"Application Service with service Id 0bfceb73-ab70-4f11-9b3d-023249c3ff40 ContainerStatsResponse error SerializationError"

Which I think is the key to what is happening. In the code above you can see that this kind of error message is printed only when API call to container host has succeeded but API response can't be deserialized.

I think the problem might be in version incompatibility between Docker (you have on your machine) and version of Service Fabric Cluster.

Can you try to install latest of both of them?

vdvarlamov commented 5 years ago

I updated all components. Scr_3 Scr_4 but the errors persisted!

OlegKarasik commented 5 years ago

Hi @vdvarlamov,

I have also tried to manipulate / reinstall / etc. but the issue still persisted. It looks like a bug in Service Fabric serialization contracts.

@MicahMcKittrick-MSFT @dkkapur @athinanthny @masnider can you please help with this one?

Making it simple the major issue is that Load Information isn't displayed because container host always reports the following errors:

"Application Service with service Id 0bfceb73-ab70-4f11-9b3d-023249c3ff40 ContainerStatsResponse error SerializationError"

"GetResourceUsageAsyncOperation Operation returned SerializationError"

I have done a small investigation (you can see my comment above) but this just confirmed that the problem is in deserialization of container host response.

mimckitt commented 5 years ago

Thanks for that. I will start an offline thread to see if I can get someone to look into it

mimckitt commented 5 years ago

Just FYI, engineers are engaged in the offline thread.

The SerializationError is because docker change the DateTime format from end with ‘Z’ to end with timezone.

Old: "read": "2015-01-08T22:57:31.547920715Z" still show from docker docs https://docs.docker.com/engine/api/v1.40/#operation/ContainerStats New: "read": "2019-10-18T17:44:36.5599007-07:00"

Our TryParse will return false because last char is not ‘Z’. //to support docker format //2018-02-23T11:22:12.1630849Z if (str.size() >= 24) { if (str[10] != L'T' || str[str.size() - 1] != L'Z') { return false; } }

They are working out the best way to correct this.

KennethDalgleish commented 5 years ago

Hi @MicahMcKittrick-MSFT , Just wanted to confirm we have the same issue, resulting in autoscaling of services not working. Eventlogs are riddled with ContainerStatsResponse error SerializationError events.

Any workaround until fix is in place would be appreciated.

Edit: Forgot to mention, this is a Azure Service Fabric cluster, not on premise. Runs on windows 2019-1809

Versions: Service Fabric: 6.5.664.9590 Docker: 19.03.2, build c92ab06ed9

vdvarlamov commented 4 years ago

Moby team and Docker team, believe that everything is fine. https://github.com/moby/moby/issues/40975

They are working out the best way to correct this.