watson-developer-cloud / python-sdk

:snake: Client library to use the IBM Watson services in Python and available in pip as watson-developer-cloud
https://pypi.org/project/ibm-watson/
Apache License 2.0
1.46k stars 827 forks source link

Very difficult to track state of Discovery Service model training process after adding training samples #339

Closed breathe closed 6 years ago

breathe commented 6 years ago

I am adding training samples to Discovery Service from our training set in batches and then evaluating query performance after each batch is added to collect data on how relevancy performance improves/changes as samples from our training set are added to the collection.

After adding a batch of training samples the best way I could figure out to determine when the ranking model has updated involves using a method like this to poll the collection details api. This method relies on non-obvious logic and the fact that training_status.successfully_trained and training_status.data_updated return empty string when model has never been trained or training data has never been added (the method wrap_run_query is used to handle timeouts/connection errors and included below just for reference)

class CustomDiscovery(DiscoveryV1):

    def wrap_run_query(self, run_query, max_failures=10):
        """Wrap a query with error-handling/retry logic"""
        def wrapped():
            num_failures = 0
            timeout = 1
            while True:
                try:
                    num_failures += 1
                    return run_query()
                except (WatsonException,  # pylint: disable=W0703
                        requests.Timeout,
                        requests.exceptions.ReadTimeout,
                        urllib3.exceptions.NewConnectionError,
                        urllib3.exceptions.ConnectionError,
                        urllib3.exceptions.ReadTimeoutError,
                        Exception
                        ) as err:
                    if num_failures > max_failures:
                        print("Watson API failure too many times in a row.  Quitting.")
                        raise err
                    error_message = str(err)
                    if "exceeded the rate limit" in error_message or "Query timed out" in error_message:
                        print("Exceeded rate limit")
                    elif "busy processing" in error_message:
                        print("Hit Update Service Limit")
                    elif 'Query failed' in error_message:
                        print("Hit Query Failed")
                    elif "ConnectTimeoutError" in error_message:
                        print("Hit Connect Timeout")
                    elif "Max retries exceeded" in error_message:
                        print("Failure when connecting")
                    else:
                        print("HIT UNKNOWN EXCEPTION: ", error_message)
                    if self.options.VERBOSE:
                        print(err)
                    print("Number of Failures: {0} Will retry in {1} seconds".format(num_failures, timeout))
                    time.sleep(timeout)
                    timeout *= 1.5
        return wrapped

    def poll_collection(self, environment_id: str, collection_id: str):
        """poll collection details until finished processing documents/training data"""
        def run_query():
            while True:
                details = self.get_collection(environment_id=environment_id, collection_id=collection_id)
                document_counts = details["document_counts"]
                training = details["training_status"]

                # returns empty string if never trained
                current_model_date = training["successfully_trained"]
                # returns empty string if training data never added
                data_update_date = training["data_updated"]
                if current_model_date:
                    current_model_date = aniso8601.parse_datetime(current_model_date)
                if data_update_date:
                    data_update_date = aniso8601.parse_datetime(data_update_date)

                if document_counts["processing"] > 0:
                    print("Document updates still processing.  {0} documents in processing queue".format(document_counts["processing"]))
                    time.sleep(2)
                elif training["processing"]:
                    print("Training updates still processing.  Total number of Samples: {0}".format(training["total_examples"]))
                    if self.options.VERBOSE:
                        print("Collection Details: ", details)
                    time.sleep(4)
                elif current_model_date and data_update_date and current_model_date < data_update_date or data_update_date and not current_model_date:
                    print(
                        "Training work is needed but training has not yet entered processing state.  Total number of Samples: {0}".format(training["total_examples"]))
                    if self.options.VERBOSE:
                        print("Collection Details: ", details)
                    time.sleep(4)
                else:
                    print("Number of documents available state after applying updates: {0}".format(document_counts["available"]))
                    print("Number of documents in processing state after applying updates: {0}".format(document_counts["processing"]))
                    print("Number of documents in failed state after applying updates: {0}".format(document_counts["failed"]))
                    if training:
                        print("Trained ranker available? {0}".format(training["available"]))
                        if training["available"]:
                            print("Model creation date: {0}".format(training["successfully_trained"]))

                        print("Number of training examples: {0}".format(training["total_examples"]))
                        print("Minimum Queries Added: {0}".format(training["minimum_queries_added"]))
                        print("Minimum Examples Added: {0}".format(training["minimum_examples_added"]))
                        print("Sufficient label diversity: {0}".format(training["sufficient_label_diversity"]))
                        print("Number of Notices: {0}".format(training["notices"]))
                    break
            return details
        return self.wrap_run_query(run_query)()

I'm observing that the training_status.processing field returned by the collection details api doesn't change state from value False to True until some indeterminate amount of time after adding a sufficient set of training samples ... With this behavior -- the client side logic needed to evaluate the system processing state in a stateless manor is kind of ugly ... I think there should be a tri or quad-state training_status.processing_state value ["no_training_needed", "training_scheduled", "training_processing"] or maybe if an error case exists ["no_training_needed", "training_scheduled", "training_processing", "training_error"]. This would give clients a simpler way to determine when the training process has converged after training data has been added to the collection.

breathe commented 6 years ago

The above code also has some serious performance issues ...

  1. add a batch of training data
  2. poll using poll_collection method above until model has finished building
  3. add more training data
  4. poll again
  5. method detects that training data is newer than model but model update doesn't trigger for a long time ...

Result: output in loop for a very long time ...:

Training work is needed but training has not yet entered processing state.  Total number of Samples: 4604
Collection Details:  {'collection_id': '...', 'name': '...', 'configuration_id': '...', 'language': 'en', 'status': 'active',
'description': 'Collection for answer database', 'created': '2017-12-26T20:40:00.578Z', 'updated': '2017-12-26T20:40:00.578Z', 'document_counts': {'available': 247, 'processing': 0, 'failed': 0}, 'disk_usage': 
{'used_bytes': 539700}, 'training_status': {'data_updated': '2017-12-26T21:09:21.380Z', 'total_examples':
 4604, 'sufficient_label_diversity': True, 'processing': False, 'minimum_examples_added': True, 
'successfully_trained': '2017-12-26T20:57:21.321Z', 'available': True, 'notices': 87, 
'minimum_queries_added': True}}

Until finally the ranker model decides to update and the output changes

Training updates still processing.  Total number of Samples: 4604
Collection Details:  {'collection_id': '...', 'name': '...', 
'configuration_id': '6d76be66-0cd1-44fd-9ae6-bdfbbac24d1d', 'language': 'en', 'status': 'active', 
'description': 'Collection for answer database', 'created': '2017-12-26T20:40:00.578Z', 
'updated': '2017-12-26T20:40:00.578Z', 'document_counts': {'available': 247, 'processing': 0, 'failed': 0}, 
'disk_usage': {'used_bytes': 539700}, 'training_status': {'data_updated': '2017-12-26T21:09:21.380Z', 
'total_examples': 4604, 'sufficient_label_diversity': True, 'processing': True, 'minimum_examples_added': 
True, 'successfully_trained': '2017-12-26T20:57:21.321Z', 'available': True, 'notices': 87, 
'minimum_queries_added': True}}

And then eventually the ranker training finishes.

So with the convoluted logic above I can detect that training samples were added such that the model is out of date and will eventually be updated -- but am I correct in observing that there doesn't seem to be a timely way to kick off the training to bring the model and training data into alignment ...?

Its unclear to me what the conditions are under which the model will update ...? With the above logic it seems to be possible to distinguish between these three states:

  1. training_data is newer than model and model will be updated but training hasn't started yet
  2. training_data has been updated and model will be updated and training has started
  3. model is newer than training_data

But transitioning from state 1 to state 2 will always take an unknown/arbitrary amount of time ...? And I'm not sure if there exists a fourth state ... ?

  1. training_data is newer than model and model will never be updated
ehdsouza commented 6 years ago

Hi @breathe, Thanks for the detailed issue. We approached the discovery service team and they asked if you could open up an issue in the WDS ideas board: https://ibm-watson.ideas.aha.io/?project=WDS where the problem could be addressed as nothing can be done in the sdk.

ehdsouza commented 6 years ago

As recommended by the service team, created an idea for the WDS: https://ibm-watson.ideas.aha.io/ideas/WDS-I-106

As it is a service feature, closing the issue

michaelkeeling commented 6 years ago

@breath training will re-run any time the data is updated. The timing for when training completes is a function of your data, your plan, and the amount of other customers training. Moving forward you should generally expect training to complete around an hour or less after the training data has been updated, though we don't make guarantees about this and there are numerous factors that determine the training performance.

For the "hidden fourth state" you can use the boolean indicators and /notices to determine the extent to which training is currently possible after data has been updated.

breathe commented 6 years ago

Hi @michaelkeeting — could you clarify what you mean? I don’t see how you can use the Boolean indicators (training and available) to distinguish between any of the above 3 (and hidden 4th state). You have to compare the timestamps as in the pseudo code above to determine whether training might be scheduled in the future and you wouldn’t be able to distinguish between a training_failure state and training_failure+data_updated+waiting_for_training_to_begin_again state...

The notices api to my understanding also won’t update until after training has started/finished.