I am trying to training a text classifier with AutoML. The imported CSV file contains some labels with rare occurrences (e.g. only 7 cases). If I try to import the dataset and train it in the UI, I get an error message in the TRAIN tab. The error message tells me that some labels do not have enough training samples. The way to fix the error is to go to ITEMS tab in the UI, and remove such labels one by one from the dataset manually and completely. After that, it is possible to go back to the TRAIN tab and start the training job.
However this is not possible using the client library. I cannot find such rare labels and remove them using the client library. So, as a feature request, I think the client library needs the following two functionalities to match what is possible in the UI:
1- Get a list of labels, along with the number of samples/occurrences for each label.
2- Remove some labels from the dataset given their name or ID.
Without these two steps, it is impossible to get to training step, unless the dataset is cleaned before AutoML. This may not however be trivial all the time. For instance if the CSV file is generated by Dataflow. In this case (which is the case I am dealing with), we will need to implement another step which imports the generated CSV and cleans it beforehand. However, it will be great if AutoML library can handle this step.
I am trying to training a text classifier with AutoML. The imported CSV file contains some labels with rare occurrences (e.g. only 7 cases). If I try to import the dataset and train it in the UI, I get an error message in the TRAIN tab. The error message tells me that some labels do not have enough training samples. The way to fix the error is to go to ITEMS tab in the UI, and remove such labels one by one from the dataset manually and completely. After that, it is possible to go back to the TRAIN tab and start the training job.
However this is not possible using the client library. I cannot find such rare labels and remove them using the client library. So, as a feature request, I think the client library needs the following two functionalities to match what is possible in the UI: 1- Get a list of labels, along with the number of samples/occurrences for each label. 2- Remove some labels from the dataset given their name or ID.
Without these two steps, it is impossible to get to training step, unless the dataset is cleaned before AutoML. This may not however be trivial all the time. For instance if the CSV file is generated by Dataflow. In this case (which is the case I am dealing with), we will need to implement another step which imports the generated CSV and cleans it beforehand. However, it will be great if AutoML library can handle this step.