[BUG] (?) Training IVF model takes infinite time on larger dataset

tomekcejner-cloudinary commented 1 year ago

What is the bug? Training a model on a dataset larger than certain size result the model stuck in training state indefinitely. Experimental evidence shows:

20k training set causes the stall
5k training set succeeds in couple of minutes

Also, training the HNSW model finishes in reasonable time even on 20k dataset.

How can one reproduce the bug? POST /_plugins/_knn/models/ivfpq/_train

training-set is an index with 20k vectors.

{
    "training_index": "training-set",
    "training_field": "embedding_vector",
    "dimension": 512,
    "description": "IVF PQ model trained on 20k set",
    "method": {
        "name": "ivf",
        "engine": "faiss",
        "space_type": "l2",
        "parameters": {
            "encoder": {
                "name": "pq",
                "parameters": {
                    "code_size": 8,
                    "m": 8
                }
            },
            "nlist": 128,
            "nprobes": 128
        }
    }
}

What is the expected behavior? The training finishes.

What is your host/environment?

Version 2.9, 2.10
Plugins k-NN

Logs

Perhaps I caused the OOE, the timestamp of model as returned by API is "timestamp": "2023-10-06T15:18:01.408890830Z"

October 6th 2023, 15:18:13.904  [2023-10-06T15:18:13,904][INFO ][o.o.m.j.JvmGcMonitorService] [opensearch-cluster-master-2] [gc][283830] overhead, spent [461ms] collecting in the last [1s]
    October 6th 2023, 15:18:14.982  [2023-10-06T15:18:14,982][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-cluster-master-2] [gc][283831] overhead, spent [1s] collecting in the last [1s]
    October 6th 2023, 15:18:15.613  [2023-10-06T15:18:15,613][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-cluster-master-2] attempting to trigger G1GC due to high heap usage [529364792]
    October 6th 2023, 15:18:15.856  [2023-10-06T15:18:15,856][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-cluster-master-2] GC did not bring memory usage down, before [529364792], after [530482504], allocations [1], duration [244]
    October 6th 2023, 15:18:15.988  [2023-10-06T15:18:15,988][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-cluster-master-2] [gc][283832] overhead, spent [986ms] collecting in the last [1s]
    October 6th 2023, 15:18:17.041  [2023-10-06T15:18:17,041][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-cluster-master-2] [gc][283833] overhead, spent [1s] collecting in the last [1s]
    October 6th 2023, 15:18:17.961  java.lang.OutOfMemoryError: Java heap space
    October 6th 2023, 15:18:17.961  Dumping heap to data/java_pid48.hprof ...
    October 6th 2023, 15:18:20.655  Heap dump file created [717475399 bytes in 2.694 secs]

jmazanec15 commented 1 year ago

@tomekcejner-cloudinary we have an open bug around model getting stuck in training on crash: #837. How much memory does the machine you are running on have?

tomekcejner-cloudinary commented 1 year ago

It's 3 node cluster, 30GB each. Used for PoC, so now indices are almost empty, and barely loaded (up to 13% utilized).

jmazanec15 commented 1 year ago

@tomekcejner-cloudinary what is the heap set at?

tomekcejner-cloudinary commented 1 year ago

@jmazanec15 Heap is set at 512MB if I read correctly:

My localhost instance for development, where training succeeded, has heap set to 1gb (running straight from public docker image). Do you suggest that heap may be the culprit?

tomekcejner-cloudinary commented 1 year ago

Update: indeed, the low heap apparently was the culprit. I increased the -Xmx and -Xms to 1g and training ran fine.

In that case, I would say that expected behavior would be:

stop training gracefully,
remove the untrained model
whenever possible tell the real reason in error message, do not make user guess

jmazanec15 commented 1 year ago

stop training gracefully,

remove the untrained model

whenever possible tell the real reason in error message, do not make user guess

@tomekcejner-cloudinary yes, these make sense. will work to get prioritized. cc @vamshin

UsenkoArtem commented 4 months ago

Hello! How can I stop training gracefully? I can’t find it in the documentation.

ryanbogan commented 4 months ago

@UsenkoArtem I don’t believe there is currently a way to stop training gracefully, since the majority of the training process happens in the JNI layer. From my understanding, we prevent users from deleting models that are currently training, because the process would still be happening behind the scenes despite the model being deleted.

jmazanec15 commented 2 weeks ago

Closing as no plan to address stopping training gracefully for now.

opensearch-project / k-NN

[BUG] (?) Training IVF model takes infinite time on larger dataset #1227