Latency report file from perf analyzer is not found

shreyassks commented 1 month ago

While profiling, the checkpoints are stored appropriately, but there seems to be intermediate latency report files generated by perf analyzer which are not found and causing the below error.

This is my Job.yaml file

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ .Release.Name }}-model-analyzer
  labels:
    app: {{ .Release.Name }}-model-analyzer
    chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    release: {{ .Release.Name }}
spec:
  backoffLimit: 5
  activeDeadlineSeconds: {{ .Values.jobTimeout }}
  template:
    spec:
      shareProcessNamespace: true
      restartPolicy: OnFailure
      volumes:
        - hostPath:
            path: /s/k3/
            type: Directory
          name: scratch
        - configMap:
            name: analyzer-config
          name: config
      terminationGracePeriodSeconds: 1800
      containers:
      - name: analyzer
        image: {{ .Values.images.analyzer.image }}
        imagePullPolicy: IfNotPresent
        securityContext:
            privileged: true
        command: ["/bin/bash", "-c"]
        args: [
          "model-analyzer profile
          --model-repository /models 
          --output-model-repository /output_models/output 
          --checkpoint-directory /checkpoints/
          --triton-launch-mode local -f /config/config.yaml 
          && model-analyzer analyze -e /results/ 
          --checkpoint-directory /checkpoints/ 
          -f /config/config.yaml 
          && model-analyzer report -e /results/
          --checkpoint-directory /checkpoints/ 
          -f /config/config.yaml"]
        volumeMounts:
            - name: scratch
              mountPath: /results
              subPath: results
            - name: scratch
              mountPath: /models
              subPath: models
            - name: scratch
              mountPath: /output_models
              subPath: output-models
            - name: scratch
              mountPath: /checkpoints
              subPath: checkpoints
            - name: config
              mountPath: /config

The error log is attached below

Details

Matplotlib created a temporary cache directory at /tmp/matplotlib-2qa7cu2g because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. [Model Analyzer] Initializing GPUDevice handles [Model Analyzer] Using GPU 0 Tesla T4 with UUID GPU-d3e4feb1-072a-c3ab-86b7-fb577104ef77 [Model Analyzer] WARNING: Overriding the output model repo path "/output_models/output" [Model Analyzer] Starting a local Triton Server [Model Analyzer] Loaded checkpoint from file /checkpoints/2.ckpt [Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition [Model Analyzer] [Model Analyzer] Starting Optuna mode search to find optimal configs [Model Analyzer] [I 2024-10-08 14:29:18,355] A new study created in memory with name: bge-small-en-v1.5-onnx [Model Analyzer] Measuring default configuration to establish a baseline measurement [Model Analyzer] Creating model config: bge-small-en-v1.5-onnx_config_default [Model Analyzer] [Model Analyzer] Profiling bge-small-en-v1.5-onnx_config_default: concurrency=16 [Model Analyzer] Saved checkpoint to /checkpoints/3.ckpt Traceback (most recent call last): File "/usr/local/bin/model-analyzer", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 278, in main analyzer.profile( File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 131, in profile self._profile_models() File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 251, in _profile_models self._model_manager.run_models(models=[model]) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 154, in run_models measurement = self._metrics_manager.execute_run_config(run_config) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 245, in execute_run_config measurement = self.profile_models(run_config) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 278, in profile_models perf_analyzer_metrics, model_gpu_metrics = self._run_perf_analyzer( File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 618, in _run_perf_analyzer status = perf_analyzer.run(metrics_to_gather, env=perf_analyzer_env) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 235, in run self._parse_outputs(metrics) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 535, in _parse_outputs self._parse_generic_outputs(metrics) File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 551, in _parse_generic_outputs with open(perf_config["latency-report-file"], mode="r") as f: FileNotFoundError: [Errno 2] No such file or directory: 'bge-small-en-v1.5-onnx-results.csv'

nv-braf commented 1 month ago

These files are written when PA terminates the profile, read and then deleted by MA. If it doesn't exist that indicates that PA had some sort of issue/failure. Can you please include the log from the PA run? Thank you.

shreyassks commented 1 month ago

I enabled below flags from perf analyzer but no logs are written in this file

    perf_output: true
    perf_output_path: /results/output.txt

triton-inference-server / model_analyzer

Latency report file from perf analyzer is not found #937