hugobuddel / orange3

Orange fork to add data pulling
Other
0 stars 0 forks source link

Unique instance identifiers #10

Open hugobuddel opened 10 years ago

hugobuddel commented 10 years ago

It would be nice if Instances could be identified by a unique identifier, next to the row index for RowInstances. See also https://github.com/biolab/orange3/issues/54 .

Living datasets often have a unique identifier for an instance in the data, but they often have no logical order and therefore no intrinsic row index. Furthermore, the instances of LazyTables created by e.g. the SelectData widget are currently not linked to the instances in the original table. These unique identifiers would facilitate this.

These unique identifiers should be considered 'strings' instead of numbers because the instances might have no natural ordering. Astronomical instance identifiers often aren't numeric.

DavidWilliams81 commented 10 years ago

Two things spring to mind:

  1. A value based on the Instances location in memory (e.g. like a pointer). Can we retrieve this in Python? I've seen addresses printed out when examining an object.
  2. A hash of the contents of the row.

Obviously an important difference is how the two approaches would handle two different instances with the same data - the hash would consider them equivalent while the addresses would be different. I believe Python doesn't have standard pass-by-value/reference semantics but instead uses something slightly unusual. This is worth keeping in mind.

I'll head over to you in half-hour and we can discuss this.

hugobuddel commented 10 years ago

I was going for a more conceptual identifier; a unique 'name' of the instance if you will. One that for example can also be used between applications or even between humans, e.g. as written in papers. So what I'm after cannot be a pointer to memory. Also, this identifier should identify the object irrespective of the features currently attached to it. E.g. it would allow Orange to request more features for a particular instance, so a hash of the row cannot be used, because not all possible features are known. Perhaps the word 'instance' was badly chosen here? See also the sections about accessing a specific identifier here: https://github.com/hugobuddel/orange3/wiki/Accessing-Data

hugobuddel commented 10 years ago

Related, perhaps it would be possible to add a 'random' index to each instance/object as well. As in, assign a deterministic pseudo-random identifier to each instance that can be used for slicing the data, ordering it randomly, etc. To be fully deterministic the random index should also be unique, or unique for all practical purposes; therefore, such a random index could also fulfill the role of the identifier described in this issue.