opensearch-project / opensearch-hadoop

Apache License 2.0
30 stars 23 forks source link

Performance Improvement compared to Elasticsearch #500

Open susasidharan opened 3 months ago

susasidharan commented 3 months ago

What is the bug?

Performed A/B testing, comparing Opensearch index data ingestion from Databricks using elasticsearch-spark-30_2.12-8.6.0.jar vs opensearch-spark-30_2.12-1.0.1.jar. The test using Opensearch Spark as the connector had timings that was 2-3 times more that of Elasticsearch Spark connector.

How can one reproduce the bug?

Test 1: Create 10 separate Opensearch index (same schema) with Parent/Child records. Run the insert or update operations into 10 indices in parallel from databricks using elasticsearch spark connector first and record the timings. Then use Opensearch spark connector and record the timings. Test 2: Create one Opensearch index. Run insert/update operations from databricks using elasticsearch spark connector and notice the timings. Then use Opensearch spark connector and notice the timings.

What is the expected behavior?

The insert/update timings should match or be similar.

What is your host/environment?

Opensearch 2.11, Databricks 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12). Both jars below hosted in S3 buckets. elasticsearch-spark-30_2.12-8.6.0.jar opensearch-spark-30_2.12-1.0.1.jar

Do you have any screenshots?

Yes Test Timings and configs.docx

dblock commented 2 months ago

Catch All Triage - 1, 2, 3

Pallavi-AWS commented 2 months ago

@anirudha will you be able to help out on this? Thanks.