googleapis / python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Apache License 2.0
634 stars 345 forks source link

Batch prediction results are not guaranteed to be in the right order #1166

Open ageron opened 2 years ago

ageron commented 2 years ago

Environment details

Steps to reproduce

  1. run a batch BatchPredictionJob that outputs multiple prediction-results-xxxxx-to-xxxxx files.
  2. the order of the predictions does not always match the order of the inputs.

Code example

This happens with official code examples such as sdk-custom-image-classification-batch.ipynb. The relevant part of the code is this:

# Get downloaded results in directory
results_files = []
for dirpath, subdirs, files in os.walk(latest_directory):
    for file in files:
        if file.startswith("prediction.results"):
            results_files.append(os.path.join(dirpath, file))

# Consolidate all the results into a list
results = []
for results_file in results_files:
    # Download each result
    with open(results_file, "r") as file:
        results.extend([json.loads(line) for line in file.readlines()])

Firstly, os.walk() does not guarantee the order. In practice, it seems to respect the order, but it's brittle to count on this.

Secondly, and more importantly, I've run into cases where the files were not in the same order as the inputs. I would get 7% accuracy on MNIST, then by just reversing the order of the prediction files, I would get 100%.

Thirdly, I haven't tested it but I suspect that the order would also be wrong if there's any error on any instance.

Lastly, the inputs may sometimes be large, and it's not efficient to include them in the predictions. I would much rather have an input identifier, such as its source file and its line index.

weichungw commented 2 years ago

Hi ageron,

Thanks fro the feedback.

Yes, batch prediction does not guarantee order. Currently we are expecting our customer to join the output by themselves. We totally agree that indexed inputs is a better solution. The feature to support that would be open to limited customers in next week. And we plan to public release that in the coming several weeks. Please let us know if you are willing to try it out.

ageron commented 2 years ago

Thanks @weichungw , yes I'd love to try it out.

david-dirring commented 2 years ago

this is still an issue right? I don't think this is a feature request. Feels like a bug to me. makes the product unusable.

any updates on the passing in an index/key field?

thanks

tfriedel commented 1 week ago

I sill experience this issue with gemini.

I make requests like this:

aiplatform.BatchPredictionJob.create(
            job_display_name=f"call_analysis_batch_{timestamp}",
            model_name=f"projects/{voice_project_id}/locations/{voice_region}/publishers/google/models/{voice_model_name}",
            instances_format="jsonl",
            predictions_format="jsonl",
            gcs_source=input_uri,
            gcs_destination_prefix=output_uri,
            sync=True,
        )

lines in the output jsonl don't match the order of the input jsonl.