Closed breathe closed 6 years ago
The above code also has some serious performance issues ...
Result: output in loop for a very long time ...:
Training work is needed but training has not yet entered processing state. Total number of Samples: 4604
Collection Details: {'collection_id': '...', 'name': '...', 'configuration_id': '...', 'language': 'en', 'status': 'active',
'description': 'Collection for answer database', 'created': '2017-12-26T20:40:00.578Z', 'updated': '2017-12-26T20:40:00.578Z', 'document_counts': {'available': 247, 'processing': 0, 'failed': 0}, 'disk_usage':
{'used_bytes': 539700}, 'training_status': {'data_updated': '2017-12-26T21:09:21.380Z', 'total_examples':
4604, 'sufficient_label_diversity': True, 'processing': False, 'minimum_examples_added': True,
'successfully_trained': '2017-12-26T20:57:21.321Z', 'available': True, 'notices': 87,
'minimum_queries_added': True}}
Until finally the ranker model decides to update and the output changes
Training updates still processing. Total number of Samples: 4604
Collection Details: {'collection_id': '...', 'name': '...',
'configuration_id': '6d76be66-0cd1-44fd-9ae6-bdfbbac24d1d', 'language': 'en', 'status': 'active',
'description': 'Collection for answer database', 'created': '2017-12-26T20:40:00.578Z',
'updated': '2017-12-26T20:40:00.578Z', 'document_counts': {'available': 247, 'processing': 0, 'failed': 0},
'disk_usage': {'used_bytes': 539700}, 'training_status': {'data_updated': '2017-12-26T21:09:21.380Z',
'total_examples': 4604, 'sufficient_label_diversity': True, 'processing': True, 'minimum_examples_added':
True, 'successfully_trained': '2017-12-26T20:57:21.321Z', 'available': True, 'notices': 87,
'minimum_queries_added': True}}
And then eventually the ranker training finishes.
So with the convoluted logic above I can detect that training samples were added such that the model is out of date and will eventually be updated -- but am I correct in observing that there doesn't seem to be a timely way to kick off the training to bring the model and training data into alignment ...?
Its unclear to me what the conditions are under which the model will update ...? With the above logic it seems to be possible to distinguish between these three states:
But transitioning from state 1 to state 2 will always take an unknown/arbitrary amount of time ...? And I'm not sure if there exists a fourth state ... ?
Hi @breathe, Thanks for the detailed issue. We approached the discovery service team and they asked if you could open up an issue in the WDS ideas board: https://ibm-watson.ideas.aha.io/?project=WDS where the problem could be addressed as nothing can be done in the sdk.
As recommended by the service team, created an idea for the WDS: https://ibm-watson.ideas.aha.io/ideas/WDS-I-106
As it is a service feature, closing the issue
@breath training will re-run any time the data is updated. The timing for when training completes is a function of your data, your plan, and the amount of other customers training. Moving forward you should generally expect training to complete around an hour or less after the training data has been updated, though we don't make guarantees about this and there are numerous factors that determine the training performance.
For the "hidden fourth state" you can use the boolean indicators and /notices
to determine the extent to which training is currently possible after data has been updated.
Hi @michaelkeeting — could you clarify what you mean? I don’t see how you can use the Boolean indicators (training and available) to distinguish between any of the above 3 (and hidden 4th state). You have to compare the timestamps as in the pseudo code above to determine whether training might be scheduled in the future and you wouldn’t be able to distinguish between a training_failure state and training_failure+data_updated+waiting_for_training_to_begin_again state...
The notices api to my understanding also won’t update until after training has started/finished.
I am adding training samples to Discovery Service from our training set in batches and then evaluating query performance after each batch is added to collect data on how relevancy performance improves/changes as samples from our training set are added to the collection.
After adding a batch of training samples the best way I could figure out to determine when the ranking model has updated involves using a method like this to poll the collection details api. This method relies on non-obvious logic and the fact that training_status.successfully_trained and training_status.data_updated return empty string when model has never been trained or training data has never been added (the method wrap_run_query is used to handle timeouts/connection errors and included below just for reference)
I'm observing that the training_status.processing field returned by the collection details api doesn't change state from value False to True until some indeterminate amount of time after adding a sufficient set of training samples ... With this behavior -- the client side logic needed to evaluate the system processing state in a stateless manor is kind of ugly ... I think there should be a tri or quad-state training_status.processing_state value ["no_training_needed", "training_scheduled", "training_processing"] or maybe if an error case exists ["no_training_needed", "training_scheduled", "training_processing", "training_error"]. This would give clients a simpler way to determine when the training process has converged after training data has been added to the collection.