googleapis / python-automl

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-automl
Apache License 2.0
87 stars 24 forks source link

Capability to remove labels before training #33

Closed happyhuman closed 3 years ago

happyhuman commented 4 years ago

I am trying to training a text classifier with AutoML. The imported CSV file contains some labels with rare occurrences (e.g. only 7 cases). If I try to import the dataset and train it in the UI, I get an error message in the TRAIN tab. The error message tells me that some labels do not have enough training samples. The way to fix the error is to go to ITEMS tab in the UI, and remove such labels one by one from the dataset manually and completely. After that, it is possible to go back to the TRAIN tab and start the training job.

However this is not possible using the client library. I cannot find such rare labels and remove them using the client library. So, as a feature request, I think the client library needs the following two functionalities to match what is possible in the UI: 1- Get a list of labels, along with the number of samples/occurrences for each label. 2- Remove some labels from the dataset given their name or ID.

Without these two steps, it is impossible to get to training step, unless the dataset is cleaned before AutoML. This may not however be trivial all the time. For instance if the CSV file is generated by Dataflow. In this case (which is the case I am dealing with), we will need to implement another step which imports the generated CSV and cleans it beforehand. However, it will be great if AutoML library can handle this step.

followthemoney1 commented 3 years ago

any updates on this?))

telpirion commented 3 years ago

I'm afraid this isn't possible with the AutoML client library. I'm going to close this issue for now.

If you need to manipulate AutoML NL labels before training, I suggest using a tool like doccano.