Closed tomekcejner-cloudinary closed 2 weeks ago
@tomekcejner-cloudinary we have an open bug around model getting stuck in training on crash: #837. How much memory does the machine you are running on have?
It's 3 node cluster, 30GB each. Used for PoC, so now indices are almost empty, and barely loaded (up to 13% utilized).
@tomekcejner-cloudinary what is the heap set at?
@jmazanec15 Heap is set at 512MB if I read correctly:
My localhost instance for development, where training succeeded, has heap set to 1gb (running straight from public docker image). Do you suggest that heap may be the culprit?
Update: indeed, the low heap apparently was the culprit. I increased the -Xmx
and -Xms
to 1g
and training ran fine.
In that case, I would say that expected behavior would be:
- stop training gracefully,
- remove the untrained model
- whenever possible tell the real reason in error message, do not make user guess
@tomekcejner-cloudinary yes, these make sense. will work to get prioritized. cc @vamshin
Hello! How can I stop training gracefully? I can’t find it in the documentation.
@UsenkoArtem I don’t believe there is currently a way to stop training gracefully, since the majority of the training process happens in the JNI layer. From my understanding, we prevent users from deleting models that are currently training, because the process would still be happening behind the scenes despite the model being deleted.
Closing as no plan to address stopping training gracefully for now.
What is the bug? Training a model on a dataset larger than certain size result the model stuck in
training
state indefinitely. Experimental evidence shows:Also, training the HNSW model finishes in reasonable time even on 20k dataset.
How can one reproduce the bug?
POST /_plugins/_knn/models/ivfpq/_train
training-set
is an index with 20k vectors.What is the expected behavior? The training finishes.
What is your host/environment?
2.9
,2.10
k-NN
Logs
Perhaps I caused the OOE, the timestamp of model as returned by API is
"timestamp": "2023-10-06T15:18:01.408890830Z"