seung-lab / cloud-volume

Read and write Neuroglancer datasets programmatically.
https://twitter.com/thundercloudvol
BSD 3-Clause "New" or "Revised" License
131 stars 47 forks source link

Relax uint64 casting of root ids #586

Open ceesem opened 1 year ago

ceesem commented 1 year ago

Cloud-volume currently (technically correctly) casts root ids as uint64, but numpy causes all sorts of issues when operating on a mixture of uint64 and much more common int64 values, producing false equalities and mangling numbers by silent conversion via float types (see https://github.com/numpy/numpy/issues/12525). However, it is our understanding that graphene will not realistically have root ids that occupy the needed additional space offered by uint64. It appears that the casting as uint64 is hard-coded e.g. https://github.com/seung-lab/cloud-volume/blob/a19c67372a07b0d267f8626ebe2847f64134d7bd/cloudvolume/frontends/graphene.py#L730, not based on a configuration variable that we could set to assure the client that int64 is just fine.

While this is fundamentally a numpy issue, it's also a fairly common and particularly confusing point of failure in using the data acquired by cloud-volume. What do you think about relaxing the uint64 casting in the case that ids are not in the uint64-only space, or else making it a configuration that could be used by cloud-volume to assert that this segmentation source is okay with ids being cast down to int64.

jasper-tms commented 1 year ago

Thanks for writing up the issue @ceesem.

Just copying a thing or two from our slack thread in case it's helpful:

CAVEclient returns dataframes where root_id values are stored as np.int64, whereas cloudvolume.get_roots returns root_ids as np.uint64 (note the u). This caused me to run head first into this absolutely disgusting behavior where different root_ids evaluate as being equal:

np.int64(648518346489760473) == np.uint64(648518346489760460)
True

Now that I know about this I can put checks in my code to make sure this doesn’t happen, but leaving this sort of check to the user is lame because a lot of users aren’t going to know to check for this. Perhaps a cleaner solution is to try to get all connectomics packages to agree to either use np.int64 or np.uint64 for root IDs. Has this been discussed before, and/or does anyone have proposals for how to prevent other users from running into this if we can’t standardize the dtype for root IDs?

william-silversmith commented 1 year ago

So here's my evaluation so far. I see what you guys are saying, int64 is less likely to run into user errors than uint64. Although, if another process produces uint64s, returning int64 will run into the same issue.