Relax uint64 casting of root ids

ceesem commented 1 year ago

Cloud-volume currently (technically correctly) casts root ids as uint64, but numpy causes all sorts of issues when operating on a mixture of uint64 and much more common int64 values, producing false equalities and mangling numbers by silent conversion via float types (see https://github.com/numpy/numpy/issues/12525). However, it is our understanding that graphene will not realistically have root ids that occupy the needed additional space offered by uint64. It appears that the casting as uint64 is hard-coded e.g. https://github.com/seung-lab/cloud-volume/blob/a19c67372a07b0d267f8626ebe2847f64134d7bd/cloudvolume/frontends/graphene.py#L730, not based on a configuration variable that we could set to assure the client that int64 is just fine.

While this is fundamentally a numpy issue, it's also a fairly common and particularly confusing point of failure in using the data acquired by cloud-volume. What do you think about relaxing the uint64 casting in the case that ids are not in the uint64-only space, or else making it a configuration that could be used by cloud-volume to assert that this segmentation source is okay with ids being cast down to int64.

jasper-tms commented 1 year ago

Thanks for writing up the issue @ceesem.

Just copying a thing or two from our slack thread in case it's helpful:

CAVEclient returns dataframes where root_id values are stored as np.int64, whereas cloudvolume.get_roots returns root_ids as np.uint64 (note the u). This caused me to run head first into this absolutely disgusting behavior where different root_ids evaluate as being equal:
np.int64(648518346489760473) == np.uint64(648518346489760460)
True
Now that I know about this I can put checks in my code to make sure this doesn’t happen, but leaving this sort of check to the user is lame because a lot of users aren’t going to know to check for this. Perhaps a cleaner solution is to try to get all connectomics packages to agree to either use np.int64 or np.uint64 for root IDs. Has this been discussed before, and/or does anyone have proposals for how to prevent other users from running into this if we can’t standardize the dtype for root IDs?

william-silversmith commented 1 year ago

So here's my evaluation so far. I see what you guys are saying, int64 is less likely to run into user errors than uint64. Although, if another process produces uint64s, returning int64 will run into the same issue.

The underlying segmentation is uint64. Several segmentation compression (compressed_segmentation, compresso, crackle) types have partial or no support at this time for signed integers. I would not advise changing the info dtype to signed as there may be cascading effects. Most of my segmentation handling libraries (think cc3d, kimimaro, zmesh, fastremap, etc) are designed first for unsigned ints. Signed ints are sometimes handled. I can't offhand remember which ones do it right.
Isn't this only safe for root IDs? If stop_layer is specified, couldn't it be a legit uint64? If the server provides information on where this is safe to do, that could work, but do you want the same function returning int64 sometimes and uint64 other times? I am basing my evaluation on this article: https://github.com/seung-lab/cloud-volume/wiki/Graphene
Are you guys sure that you will never ever want to use that top bit?
Adding something to the info file would be workable per server, but the logic will have to be maintained indefinitely for backwards compatibility.
CAVE could have an int64 interface by using arr.view(np.int64), though you'd need to adjust each call site. It would then interact badly with CloudVolume when CloudVolume is used independently.
Other CloudVolume calls (such as getting watershed) will continue to return uint64 (unless the dtype changes), which could provide bad interactions.

seung-lab / cloud-volume

Relax uint64 casting of root ids #586