Open inoueakimitsu opened 3 years ago
It might be better to allow the user to set the learning schedule freely by allowing the learned learner to be given as an argument.
Should the subsampling be for bags or instances?
The advantage of the method performed on bags is its simplicity. Because of its simplicity, this method goes well with the solution of giving a trained model as an argument. However, if the number of bags is not large, there is a concern that the accuracy will deteriorate.
The method for instances is more complicated in that you need to change the values of lower_threshold and upper_threshold. For example, if you want to subsample by a factor of 10, you should multiply lower_threshold and upper_threshold by 0.1. If this makes the lower_threshold and upper_threshold too small, the target instance may not be included in any of the subsamples. Therefore, you may need to change the subsample ratio for each bag.
By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.
Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.
We can get the posterior probability that the number of target instances does not reach the lower limit, and conversely, the posterior probability that the number of target instances exceeds the upper limit. Let's use this.
In addition to subsampling, the following approaches can be taken:
One way to implement a distributed algorithm is as follows:
By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.
Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.
When subsample the sample of a bag, it is very likely that the fewest classes in that bag will disappear. If the class imbalance is strong, subsampling the sample is not practical.
I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:
n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)
a1
represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.
Based on this result, determine the initial label of the instances.I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:
- Flatten all the bags.
- Clustering of all instances.
- Express the clustering results in One-hot vectors.
- Find the within-bag sum of the One-hot vectors of the instances in each bag.
- Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.
n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)
- In the above equation,
a1
represents the ratio of instances belonging to n_class1 among the instances of n_cluster1. Based on this result, determine the initial label of the instances.
minimal code:
from sklearn.cluster import KMeans
flatten_features = np.vstack(bags)
n_clusters = 15
cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)
cluster_generator.fit(flatten_features)
# flatten_features.shape
# cluster_generator.labels_
from sklearn.preprocessing import OneHotEncoder
cluster_encoder = OneHotEncoder(sparse=False)
cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)
one_hot_encoded = [
cluster_encoder.transform(
cluster_generator.predict(b).reshape((-1, 1))) for b in bags]
cluster_encoder.transform(
cluster_generator.predict(bags[0]).reshape((-1, 1))
)
bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])
print(bag_cluster_table[:8])
print("---")
print(lower_threshold[:8])
# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]
# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]
I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:
Flatten all the bags.
Clustering of all instances.
Express the clustering results in One-hot vectors.
Find the within-bag sum of the One-hot vectors of the instances in each bag.
Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.
n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)
In the above equation,
a1
represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.Based on this result, determine the initial label of the instances.
minimal code:
from sklearn.cluster import KMeans flatten_features = np.vstack(bags) n_clusters = 15 cluster_generator = KMeans(n_clusters=n_clusters, random_state=0) cluster_generator.fit(flatten_features) # flatten_features.shape # cluster_generator.labels_ from sklearn.preprocessing import OneHotEncoder cluster_encoder = OneHotEncoder(sparse=False) cluster_encoder.fit(np.array([np.arange(n_clusters)]).T) one_hot_encoded = [ cluster_encoder.transform( cluster_generator.predict(b).reshape((-1, 1))) for b in bags] cluster_encoder.transform( cluster_generator.predict(bags[0]).reshape((-1, 1)) ) bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded]) print(bag_cluster_table[:8]) print("---") print(lower_threshold[:8]) # lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster] # upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]
I have implemented this idea. clustermil
It required some non-trivial improvements, but I basically followed this idea. The accuracy of this method may not be good. However, the convergence is fast and it is a good candidate as a method for generating initial labels.
how about Kernel Approximation and SGDClassifier based Linear SVM?
The following policy is a good one.
When the total number of instances is large, the computation time of most scikit-learn trainers becomes unacceptably long. In early iterations of multiple instance learning, it may not be necessary to include all instances in the training data. Therefore, we will add a function to adjust the amount of subsamples used for training according to the progress of convergence of multiple instance learning.