Allow for subsampling when inputting large training data

inoueakimitsu commented 3 years ago

When the total number of instances is large, the computation time of most scikit-learn trainers becomes unacceptably long. In early iterations of multiple instance learning, it may not be necessary to include all instances in the training data. Therefore, we will add a function to adjust the amount of subsamples used for training according to the progress of convergence of multiple instance learning.

inoueakimitsu commented 3 years ago

It might be better to allow the user to set the learning schedule freely by allowing the learned learner to be given as an argument.

inoueakimitsu commented 3 years ago

Should the subsampling be for bags or instances?

The advantage of the method performed on bags is its simplicity. Because of its simplicity, this method goes well with the solution of giving a trained model as an argument. However, if the number of bags is not large, there is a concern that the accuracy will deteriorate.

The method for instances is more complicated in that you need to change the values of lower_threshold and upper_threshold. For example, if you want to subsample by a factor of 10, you should multiply lower_threshold and upper_threshold by 0.1. If this makes the lower_threshold and upper_threshold too small, the target instance may not be included in any of the subsamples. Therefore, you may need to change the subsample ratio for each bag.

inoueakimitsu commented 3 years ago

By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.

Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.

inoueakimitsu commented 3 years ago

We can get the posterior probability that the number of target instances does not reach the lower limit, and conversely, the posterior probability that the number of target instances exceeds the upper limit. Let's use this.

inoueakimitsu commented 3 years ago

In addition to subsampling, the following approaches can be taken:

online algorithms
distributed algorithms

One way to implement a distributed algorithm is as follows:

Divide bags randomly into subsets.
From each subset, a classifier is trained.
Integrate them with sklearn.ensemble.VotingClassifier

inoueakimitsu commented 3 years ago

By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.

Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.

When subsample the sample of a bag, it is very likely that the fewest classes in that bag will disappear. If the class imbalance is strong, subsampling the sample is not practical.

inoueakimitsu commented 3 years ago

I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:

Flatten all the bags.
Clustering of all instances.
Express the clustering results in One-hot vectors.
Find the within-bag sum of the One-hot vectors of the instances in each bag.
Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1. Based on this result, determine the initial label of the instances.

inoueakimitsu commented 3 years ago

I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:

Flatten all the bags.

Clustering of all instances.

Express the clustering results in One-hot vectors.

Find the within-bag sum of the One-hot vectors of the instances in each bag.

Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1. Based on this result, determine the initial label of the instances.

minimal code:

from sklearn.cluster import KMeans
flatten_features = np.vstack(bags)
n_clusters = 15
cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)
cluster_generator.fit(flatten_features)
# flatten_features.shape
# cluster_generator.labels_
from sklearn.preprocessing import OneHotEncoder
cluster_encoder = OneHotEncoder(sparse=False)
cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)

one_hot_encoded = [
  cluster_encoder.transform(
    cluster_generator.predict(b).reshape((-1, 1))) for b in bags]

cluster_encoder.transform(
    cluster_generator.predict(bags[0]).reshape((-1, 1))
)

bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])
print(bag_cluster_table[:8])
print("---")
print(lower_threshold[:8])

# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]
# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]

inoueakimitsu commented 3 years ago

I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:

Flatten all the bags.

Clustering of all instances.

Express the clustering results in One-hot vectors.

Find the within-bag sum of the One-hot vectors of the instances in each bag.

Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.

Based on this result, determine the initial label of the instances.

minimal code:
from sklearn.cluster import KMeans

flatten_features = np.vstack(bags)

n_clusters = 15

cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)

cluster_generator.fit(flatten_features)

# flatten_features.shape

# cluster_generator.labels_

from sklearn.preprocessing import OneHotEncoder

cluster_encoder = OneHotEncoder(sparse=False)

cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)

one_hot_encoded = [

  cluster_encoder.transform(

    cluster_generator.predict(b).reshape((-1, 1))) for b in bags]

cluster_encoder.transform(

    cluster_generator.predict(bags[0]).reshape((-1, 1))

)

bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])

print(bag_cluster_table[:8])

print("---")

print(lower_threshold[:8])

# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]

# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]

I have implemented this idea. clustermil

It required some non-trivial improvements, but I basically followed this idea. The accuracy of this method may not be good. However, the convergence is fast and it is a good candidate as a method for generating initial labels.

inoueakimitsu commented 3 years ago

how about Kernel Approximation and SGDClassifier based Linear SVM?

inoueakimitsu commented 3 years ago

The following policy is a good one.

Add support for skorch and sklearn.SGDClassifier.
Add online argument to fit method.
When the online argument is true, allow bags to use tf.data.Dataset or torch's DataLoader instead of np.ndarray's list.

inoueakimitsu / milwrap

Allow for subsampling when inputting large training data #1