ntucllab / libact

Pool-based active learning in Python
http://libact.readthedocs.org/
BSD 2-Clause "Simplified" License
778 stars 174 forks source link

kl_divergence inside QBC does not work #134

Open terry07 opened 6 years ago

terry07 commented 6 years ago

I am trying to get appropriate results through the disagreement method of kl_divergence, but an error is returned each time, reporting "mtrand.pyx", line 1121, in mtrand.RnadomState.choice - ValueError: a must be non-empty

Any ideas? Thanks in advance.

yangarbiter commented 6 years ago

It seems like the avg_kl is empty in this case. https://github.com/ntucllab/libact/blob/master/libact/query_strategies/query_by_committee.py#L208

Can you make sure that the unlabeled pool is not empty?

terry07 commented 6 years ago

Thanks for this notification. I used some flags and i noticed that the avg_kl ndarray consists of nan values, except one only. Is this the proper function?

yangarbiter commented 6 years ago

I don't think it is the proper function.

I guess these nan are generated here by the log function L156

One thing to check is the probability output of the students L204, which model are you using for the students?

terry07 commented 6 years ago

I am using ExtraTrees and SVC, but i tried also LogisticRegression as the example in the corresponding script, but i got the same error.

yangarbiter commented 6 years ago

Can you use a python debugger to check the value in the variable proba and check if the values in that list are all valid probability (0<p<1 and sum of each row are 1) L205

Thanks.

terry07 commented 6 years ago

The dimensions of the exported proba are: (935, 3, 8) -> (number of unlabeled instances, number of students, numbers of classes)

The result of print np.sum(proba[:,0]) , np.sum(proba[:,1]) , np.sum(proba[:,2]) is 935.0 935.0 935.0 without any of these values violating probability terms.

yangarbiter commented 6 years ago

The probability should also not being 0 in proba and consensus https://github.com/ntucllab/libact/blob/master/libact/query_strategies/query_by_committee.py#L153. Maybe the probability output should be added with a small epsilon to all probability.

I would suggest using https://github.com/gotcha/ipdb to trace the code and find out where the nan starts to come out.