maria-yampolskaya / ProjectOAK

Group project for PY 895 Machine Learning for Physicists (at Boston University)
0 stars 0 forks source link

Add one-hot vector compatibility #4

Closed maria-yampolskaya closed 3 years ago

maria-yampolskaya commented 3 years ago

I created a small function to convert from type names to a one-hot vector (for multi-label encoding, so we can use categorical cross-entropy and scikit multilearn):

def make_onehot(type1, type2):
  ''' given strings type1 and type2 (type2 can be empty), return a one-hot vector'''
  vec = np.zeros(len(dp.ALLTYPES)-1) # ignore the empty type at the end
  vec[dp.ALLTYPES.index(type1)] = 1
  if type2 != '':
    vec[dp.ALLTYPES.index(type2)] = 1
  return vec

Then I created a list of these labels:

prim_types = csvdata[used_rows2, cc.TYPE1]
second_types = csvdata[used_rows2, cc.TYPE2]

onehot_labels = [make_onehot(prim_types[i], second_types[i]) for i in range(len(prim_types))]

But when I try to create a dataset:

sdd = dp.Full_Dataset(used_images2/255., onehot_labels, serials=identifiers, val_size=0.2, do_scaling=False)

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-10-16d9c0a0ba7c> in <module>()
      1 ## here is a version of the dataset which we scale by 255
      2 identifiers = csvdata[used_rows2][:,[cc.NAME, cc.TYPE1, cc.TYPE2, cc.SERIAL]]
----> 3 sdd = dp.Full_Dataset(used_images2/255., onehot_labels, serials=identifiers, val_size=0.2, do_scaling=False)
      4 
      5 print('train_size: ', sdd.train_size, ', val_size: ', sdd.val_size, ', test_size: ', sdd.test_size, sep='')

1 frames
/content/drive/My Drive/ProjectOAK/ProjectOAK-main/DataProcessing.py in _argsplit_attr(self, attr_name, attr_values)
    256         for key in self.argsplit.keys(): #(for key in ['train', 'val', 'test']):
    257             if self.verbose: print('| setting: {:15s}'.format(key+'_'+attr_name), end='')
--> 258             setattr(self, key+'_'+attr_name, attr_values[self.argsplit[key]])
    259         if self.verbose: print('')
    260 

TypeError: list indices must be integers or slices, not list

It seems that the algorithm to split the data into training, validation, and test is hard-coded to only accept integer labels. Help me add one-hot vector compatibility pls

Sevans711 commented 3 years ago

Working on this now. Woah, how did I not learn about .index until now? What a nice function for lists ... >.>

Sevans711 commented 3 years ago

Resolved. You needed to pass data and labels as numpy arrays. I've changed the code so this is no longer necessary; things will be typecasted into numpy arrays where necessary.

i.e. this would have solved the problem: dp.Full_Dataset(used_images2/255., np.array(onehot_labels), serials=identifiers, val_size=0.2, do_scaling=False)

I tended to avoid ever making things numpy arrays inside functions, because by default np.array(x) copies the data even if x was a numpy array already. But I looked up the documentation and it turns out there's a flag for copy=False which behaves how you want it to.. e.g.:

x = np.array([1,2,3])
np.array(x, copy=False) is x
>> True
y = [1,2,3]
np.array(y, copy=False) is y
>> False

So now I typecast the arrays to numpy arrays inside the class initialization. This is actually going to be really helpful in a lot of places for me, this (implicitly hoping the user only uses numpy arrays, because I don't want to waste memory & time by copying the data) is something that has bothered me for a long time - now I know the fix :)