raamana / pyradigm

Research data management in biomedical and machine learning applications
http://raamana.github.io/pyradigm/
MIT License
28 stars 13 forks source link

add_samplet: feature_names allows dimension mismatch, order isn't paired -- will overwrite #45

Open WillForan opened 3 years ago

WillForan commented 3 years ago

I had a few bugs (using wrong variable name), and realized I never got yelled at for providing bad feature names.

A few observations:

1) feature name length doesn't have to match features.

there can be too many (x, y, z and an additional "DNE" name)

ds = RegrDataset()
ds.descritpion="extra of feauture names"
ds.add_samplet('id1', target=100, features=[1,2,3], feature_names=['x','y','z'])
ds.add_samplet('id2', target=200, features=[4,5,6], feature_names=['x','y','z','DNE'])
(x, _, _) = ds.data_and_targets()
print(ds.feature_names)
print(x)

['x' 'y' 'z' 'DNE'] [[1. 2. 3.] [4. 5. 6.]]

or too few (only x, but have x, y, and z)

ds = RegrDataset()
ds.descritpion="extra of feauture names"
ds.add_samplet('id1', target=100, features=[1,2,3], feature_names=['x'])
ds.add_samplet('id2', target=200, features=[6,5,4], feature_names=['x'])
[x, _, _] = ds.data_and_targets()
print(ds.feature_names)
print(x)

['x'] [[1. 2. 3.] [6. 5. 4.]]

2) specifying feature names for one samplet changes names everywhere?

ds = RegrDataset()
ds.descritpion="extra of feauture names"
ds.add_samplet('id1', target=100, features=[1,2,3], feature_names=['x','y','z'])
ds.add_samplet('id2', target=200, features=[4,5,6], feature_names=['y','y','z'])
[x, _, _] = ds.data_and_targets()
print(ds.feature_names)
print(x)

['y' 'y' 'z'] [[1. 2. 3.] [4. 5. 6.]]

this is a potentially surprising when features given to add_samplet in a different order -- even if feature and feature_names are paired correctly (@raamana -- a thing you warned me to check. good eye!)

ds = RegrDataset()
ds.descritpion="extra of feauture names"
ds.add_samplet('id1', target=100, features=[1,2,3], feature_names=['x','y','z'])
ds.add_samplet('id2', target=200, features=[6,5,4], feature_names=['z','y','x'])
[x, _, _] = ds.data_and_targets()
print(ds.feature_names)
print(x)

['z' 'y' 'x'] [[1. 2. 3.] [6. 5. 4.]]

raamana commented 3 years ago

Thanks a lot Will for putting pyradigm to test and reporting these bugs!

Let me look into them and see why they that happened. but these bugs hopefully haven't prevented you from running comparisons? I am zoom and we can discuss this more if you want -- and to prepare for the "progress report" so to say.