google / ml-metadata

For recording and retrieving metadata associated with ML developer and data scientist workflows.
https://www.tensorflow.org/tfx/guide/mlmd
Apache License 2.0
603 stars 138 forks source link

Extremely slow performance using remote mlmd instance #157

Closed htahir1 closed 1 year ago

htahir1 commented 2 years ago

Hi! Our team has been trying to use ml-metadata to run pipelines using a Cloud SQL backend. However, we have run into performance issues. Let's say my metadata connection config looks like this:

connection_config {
  mysql {
    host: '34.79.128.231'
    port: 3306
    database: 'my_database'
    user: 'root'
    password: '***'
    ssl_options {
      key: 'client-key.pem'
      cert: 'client-cert.pem'
      ca: 'server-ca.pem'
      capath: '/'
      verify_server_cert: false
    }
    skip_db_creation: false
  }
}

And run a TFX pipeline with the above, there is a near 60 second waiting time between running components. This is likely caused.

We have also tried to run through the gRPC server rather than connecting to the DB directly but still the same result. On the other hand, when using the internal Kubeflow MLMD or using Kubeflow (Vertex) with Cloud SQL directly in the same VPC then the performance is fast as expected. The only time you run into this problem is when you try to locally run the pipeline and connect to a public IP like 34.79.128.231.

Is there a way to solve this problem? We would like to ml metadata independant of Kubeflow or Vertex.

BrianSong commented 2 years ago

I am not sure if the "there is a near 60 second waiting time between running components" is caused by ml-metadata considering ml-metadata works as expected in Vertex.

I saw you also raised an issue at TFX, saying that this is more of an issue with the launcher.Launch logic in TFX. Maybe let's wait how TFX folks response to this, I am happy to help if needed.

htahir1 commented 2 years ago

Yes @BrianSong sorry about my rushed report here, I think the problem is more about the way MLMD is utilized rather than a performance problem in the library itself. Also happy to close this issue and keep the other issue alive to track progress