pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.06k stars 824 forks source link

Add a new option to enable appending -javaagent options #2717

Open thatsdone opened 8 months ago

thatsdone commented 8 months ago

🚀 The feature

A new option to torchserve command to allow specifying javaagent to be passed to JVM.

The below is an example how the option could be used. Please note that ","(comma) is used as separator to allow specifying multiple javaagents . (Thus, the suggested new option is '-javaagents' not '-javaagent'

--javaagents PATH_TO/opentelemetry-javaagent.jar,PATH_TO/jmx_prometheus_javaagent.jar=8090:PATH_TO/prometheus.yaml

In case of the above example, 2 javaagents, OpenTelemetry Java instrumentation and JMX exporter, are specified.

Motivation, pitch

I'm working on End-to-End system observability, and I think by having the option, we we can improve observability by using existing distributed tracing ecosystem such as OpenTelemetry. In case of OpenTelemetry Java Instrumentation, it's automatic instrumentation and there are no needs to modify TorchServe code.

Furthermore, as '-javaagent' is a generic option of JVM, the option can be utilized for other purposes.

Alternatives

Without this option, it's possible, but we need to modify TorchServe image.

Additional context

I think that the implementation would be simple. Adding code to append '-javaagent' (and values) around L147 of ts/torch_server.py (and corresponding option definition in ts/arg_parsers.py).

https://github.com/pytorch/serve/blob/master/ts/model_server.py#L147

144
145         cmd.append("-cp")
146         cmd.append(class_path)
147          # HERE, '--javaagents' arguments can be parsed and appended.
148         cmd.append("org.pytorch.serve.ModelServer")
149
150         # model-server.jar command line parameters
msaroufim commented 8 months ago

I'd be open to that, cc @lxning what do you think?

msaroufim commented 8 months ago

@thatsdone would you be open to making this PR and showing us what's possible with it?

thatsdone commented 8 months ago

Hi @msaroufim Yes, I can work on this feature. Please assign this issue to me (and give me some time as I'm a bit busy now...haha). Best regards, Masanori

lxning commented 8 months ago

@thatsdone Thank you for the proposal. OpenTelemetry is a very useful feature to TorchServe customers. I have been thinking of adding this feature for a while. It would be great if you want to contribute.

However, I'm not clear your use case which requests to enable OpenTelemetry or Prometheus agent in TorchServe. These agents usually run in separate daemon or even remote.

thatsdone commented 8 months ago

@lxning The high level use case that I'm working on now is a kind of image processing system that receives (movie or still) image data and does inferencing using several ML models. It runs on top of k8s cluster, and TorchServe offers inference function at the very bottom layer of the system. My motivation is observability of the entire end-to-end workflow including the most bottom layer.

However, I'm not clear your use case which requests to enable OpenTelemetry or Prometheus agent in TorchServe. These agents usually run in separate daemon or even remote.

Ah, in case of add-on functions provided by so called 'javaagent', JVM loads the additional jar (specified by the '-javaagent' option) into its address space, and the codes run within the same address space with the JVM.

I guess you are thinking about so called OpenTelemetry Collector. This is intended for kind of tiering solution for data collection of distributed tracing.

Process #1                                                Process #2                    Process #3
[(Java Application + OTEL Java Instrumentation)/JVM ] ==> [OpenTelemetry Collector] ==> [Trace server (e.g. Jaeger)]

Under k8s environment, OpenTelemetry Operator is often used to inject the Collector (process#2 above) into each pod as a sidecar container, and normally Trace server runs remotely.

I hope I'm enough clear.

P.S. Please note the structure above is language runtime dependent. In case of Go, for example, golang auto-instrumentation uses eBPF and runs as a separate process.