microsoft / service-fabric-services-and-actors-dotnet

Reliable Services and Reliable Actors are Service Fabric application frameworks for building highly-scalable distributed cloud applications.
Other
268 stars 116 forks source link

Disable the actor/service etw logs #179

Open aL3891 opened 5 years ago

aL3891 commented 5 years ago

In our system that a very significant amount of io and diskspace use is taken up by the etw files written for each actor activation and service call.

Is there a way to prevent these from being written to disk/azure storage? The fabric events are fine but the app model event just swamps our system and are never looked at anyway. currently we're writing 50 gb of data to storage about every three days.

Ideally we'd like to use azure log analytics instead since that allows enabling and disabling listeners as well as perf counters, but then we'd like to disable the built in etw listeners (fabricDCA?)

juho-hanhimaki commented 5 years ago

Yep, I feel there should be an option to reduce logging on app/service level and still keep cluster in supported state.

I believe the following parameter helps with the issue but changing it also makes the cluster unsupported which is a shame. https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings#traceetw

aL3891 commented 5 years ago

I've actually set that as well, but it doesn't seem to have an effect, or there is some other parameter that also needs to be set. There basically no docs for fabricDca or the logging in general, but from from what I've been able to guess there are "producers" of logs and "consumers".

In the settings there is a setting section called ServiceFabricEtlFile that is referenced by what I guess are sections configuring the file/azure logging, but I've found zero docs on this.. :(

amanbha commented 5 years ago

@aL3891 These events are required for providing support of the framework as this is our primary diagnostic story for Service Fabric, These traces are looked at by our support team on daily basis to provide support to our customers both at the cluster level & application level(specially when actors and service remoting is used). Changing this needs considerable redesign of the overall diagnostic story of Service Fabric which spans over Azure, On Premise,(Windows and Linux) and is not planned at the moment. If you want to reduce the tracing, the cluster would be unsupported as mentioned in the documentation. @romendmsft can provide more info about setting the trace value.

juho-hanhimaki commented 5 years ago

@amanbha That is very understandable, but I feel application level logging should be controllable on application level.

Cluster being healthy and supported is in my opinion whole other thing than the applications that are ran in it.

For us there are some applications/services that are more performance critical (that still use the built-in remoting and reliable collections with custom serialization). I feel its very important to be able to make decisions on app/service level instead of whole cluster. Now its pretty much all or nothing kind of deal if one wishes to use the app model offered by SF.

aL3891 commented 5 years ago

I don't have a problem with cluster level logging, i'm even fine with having app model traces when diagnosing a problem but to require them to be on at such a high level at all times for all services seems very unreasonable.. Again, what i'd like to reduce tracing for is specifically reliable actors and services, and ideally only some services as well.

For example we use actors for things like worker procesess and session management, these have absolutely no need to that kind of diagnosis and doesn't even save their state. They have millions of actor activation/deactivate/call events and you're saying that we need to log Every. Single. One. of those to get any support? again, this seems very strange..

The reality is that we're looking at scaling back our use of reliable actors and services because of this issue in favor of asp.net, (that manages to be supported by Microsoft, without logging every request to disk and azure storage, while being far more widely adopted than service fabric)

This coupled with the complete lack of support in the next evolution of your own product , SF Mesh makes it very hard to recommend reliable services/actors to anyone.

I am sorry but this is quite frustrating. Especially since we have not once been asked about etw files while interacting with the SF team or support in all the years we used SF (starting when SF was in private preview)

amanbha commented 5 years ago

@aL3891 , @juho-hanhimaki For Remoting, many customers don't add enough logging to their api calls so its very difficult to see what happened in production environment at any particular time and we do get of incidents to root cause such issues. Same is true for Reliable Collections and Actors. Although you are writing your services but SF product team provides support for the components which are part of your service process. I understand your concern that there should be a way to control it app level, but that would make it difficult to provide support for the components. Its a good feedback and will provide it to diagnostics team, to provide control of logging for app process at app layer(but it would mean no support from product team if app developer chooses to turn off the events) which will be on by default,

@aL3891 These etl files although never asked directly from customer, are uploaded to azure storage (for clusters in Azure ) for access by our team when support issues are filed, so these files are used for support. For on premise clusters we do ask for these files. I understand your concerns about SF Mesh, but its a different model altogether in which you deploy containers, Actors and Reliable services interact with the SF runtime environment and SF runtime environment is not exposed inside the containers in Mesh as it will be a multi-tenant service, there will be some sort of actor model and persistent storage(details to be sorted aout). If you have concerns or feedback about SF Mesh programming models, I would encourage you to raise it in Service Fabric Mesh github repo, this is the relevant thread https://github.com/Azure/service-fabric-mesh-preview/issues/295

juho-hanhimaki commented 5 years ago

@aL3891 , @juho-hanhimaki For Remoting, many customers don't add enough logging to their api calls so its very difficult to see what happened in production environment at any particular time and we do get of incidents to root cause such issues. Same is true for Reliable Collections and Actors. Although you are writing your services but SF product team provides support for the components which are part of your service process. I understand your concern that there should be a way to control it app level, but that would make it difficult to provide support for the components. Its a good feedback and will provide it to diagnostics team, to provide control of logging for app process at app layer(but it would mean no support from product team if app developer chooses to turn off the events) which will be on by default,

Yep all I am hoping is for more control. I might be fine with some performance critical parts of our applications not receiving Microsoft support.

Generally I hope that Service Fabric reaches such maturity that support is not (at least often) required to root cause application level issues (bad, buggy code). But maybe that's not be a realistic as Service Fabric adds some complexity and often the documentation doesn't really cover all things.

aL3891 commented 5 years ago

@amanbha But surely you ask before acessing customers files even on azure correct? We have never been asked about this. The point about mesh i've made elsewhere already, since the first private preview of mesh that we were part of. You could absolutley make the app model work, or atleast have a migration path or even a roadmap, but that has not been a priority.

I think its completely unreasonable to not even have a way to disable app model logging just because some customers don't have it in their apps.. I have it, why am I impacted by the actions of some other customer? Are you aware of any other app framework that does this? again i'd draw the paralell to asp.net. It is supported by microsoft, its frankly far larger in both scope and adoption than service fabric, and it has much more flexible logging. I don't even mind having logging, I just want to do it in a more efficient manner, like with azure log analytics.

i'm completley fine with some services not being supported, what i think is really unfair though is to designate the entire cluster as unsupported because i dont log every actor activation. You don't have those logs if im using another app model or another actor framework, yet those clusters are supported. There needs to be separation there.

I'd also note that devtest clusters as well as my own local dev box also have this restrictions in place and they are not supported anyway, yet the app model logs can't be disabled there either as far as i can tell. This really inflates the costs of dev clusters since you have to use much larger VMs to get the temp storage needed to be able to store all those logs. we're rewritten the template to use managed disks but still, a lot of people wont have time and resources to do that.

amanbha commented 5 years ago

Since Service Fabric provides both data and communication layer so customers expect us to support issues which happened in production in past. I hear your feedback but unfortunately it cannot be disabled at application level at the moment and as mentioned above will share it with diagnostics team. Reg. Mesh, I would encourage you to please voice the concern in the issue link I shared above or in mesh repo.

aL3891 commented 5 years ago

Allright, then we know. Thank you for the information

juho-hanhimaki commented 5 years ago

This is still relevant to us. We run highly optimized IoT service (for IoT device data volume). We still use built-in remoting to move data between services, but because we have no control over the diagnostics volume generated we might need to replace the remoting as a whole (which seems waste of resources, because otherwise remoting works well).

We have millions of remoting calls per day which are caused by IoT data sent from devices and it makes little sense to log each and every one of those calls. Yet we need to keep the telemetry enabled at level 4 for whole Service Fabric (including the remoting) to receive any support. It makes no sense from money and resource usage perspective.