lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
899 stars 105 forks source link

explicitly pass module name to namedtuple declarations #190

Closed aaronknister closed 1 month ago

aaronknister commented 2 years ago

this was suggested by @kmurphy4 on github.com/lmcinnes/umap/issues/477 as a work around to pyspark hijacking collection.namedtuple

jamestwebber commented 2 years ago

I don't love to add this just because someone else has a bug, but I guess it's pretty harmless (and prevents other libraries from introducing the same issue, as unlikely as that may be).

Have you confirmed that this fixes the issue with pyspark?

lmcinnes commented 2 years ago

If it can fix a pyspark issue that's definitely a good thing since it is a relatively small change. I would want to ensure that this all plays nice with numba though, since numba support for named-tuples is performance critical in some places, and may not be happy with all of this (hence the desire to check). All the tests have passed, so that's deifnitely a good sign that numba is failing on the extra keyword argument. Just to be sure, however, I would really appreciate a quick performance benchmark comparison (which the tests don't check) -- so index build time comparisons, and maybe if you have time a comparison of with and without this change on, say, the fashion-mnist ann-benchmarks, or something equivalent that decently exercises the code and records performance measures.

kmurphy4 commented 1 month ago

FWIW the underlying pyspark issue was fixed in v3.3.0: https://spark.apache.org/releases/spark-release-3-3-0.html#other-notable-changes-2.