opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99 stars 136 forks source link

[Feature/multi_tenancy] Implement client.bulk in SDK Client #3192

Closed dbwiddis closed 4 days ago

dbwiddis commented 3 weeks ago

Description

Adds an SDKClient equivalent to client.bulk().

Summary of changes (helpful for reviewers):

  1. Added a new DataObjectRequest parent class. I originally started with a tagging interface just to help method signatures, but realized that four classes duplicated the same code for the index, id, and tenantId fields, so refactored those (and the associated builder code) into the superclass. Existing unit tests cover this change. https://github.com/opensearch-project/ml-commons/pull/3192/commits/caeea219cef12a70d344f2639f17f13ac9a62956
  2. Added a new DataObjectResponse parent class similar to above, refactoring the id and parser fields from the subclass. Existing unit tests cover this change. https://github.com/opensearch-project/ml-commons/pull/3192/commits/445936bcfa23410778cc024a485b69b307a44c62 (part 1)
  3. Added new BulkDataObjectRequest and BulkDataObjectResponse classes and unit tests. Added a failed field to the response superclass to track failures/errors. https://github.com/opensearch-project/ml-commons/pull/3192/commits/445936bcfa23410778cc024a485b69b307a44c62 (part 2)
  4. Added the interface and stubbed implementations for the SdkClient .bulk() method, and unit tests https://github.com/opensearch-project/ml-commons/pull/3192/commits/22dccf4af0c41f07625ffe6857fd17df80b0bf79
  5. Implemented the new interface method in all 3 clients. The local and remote implementations use the OpenSearch bulk request natively. The DynamoDB implementation simply iterates the bulk request and calls existing code for the individual requests; this is not the most efficient but necessary to preserve ordering. If all the bulk requests are put and delete, it's possible to optimize this with DDB BulkWrite, which is a good future optimization; however since our primary use case uses update I'm deferring that complexity. https://github.com/opensearch-project/ml-commons/pull/3192/commits/ee30d210818a84b89027860e5c3a0ac7762a950b
  6. Additional refactoring in response to code review: move the details of ingest_took to a lower level https://github.com/opensearch-project/ml-commons/pull/3192/commits/7f190785408ccbe60899eb6dc45fa78bcafd1033
  7. More parameters in the sdkclient request/response to handle the needed implementation https://github.com/opensearch-project/ml-commons/pull/3192/commits/05a64d1553bc8b60be278d16b59bad488344a972
  8. Implementation and tests https://github.com/opensearch-project/ml-commons/pull/3192/commits/e2a5ec638ef83265bc751bd788af77c78d72b746

In addition there are several whitespace / import reordering changes due to spotless not having been run on the common module.

Will be handled integ tests in #2818

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

dbwiddis commented 2 weeks ago

@dhrubo-os @arjunkumargiri This is ready for (re-) review.