rapidsai / cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms
https://docs.rapids.ai/api/cuspatial/stable/
Apache License 2.0
616 stars 154 forks source link

Proposed New Layout of the Python Library #644

Closed isVoid closed 2 years ago

isVoid commented 2 years ago

Motivation

Currently cuspatial's file structure is mostly flat, with two modules whose name have overlapping meanings: core and geometry. We should refactor the structure of the library so that the physical layout of the folder structure corresponds to the extent of the library, as described in the leading section of docs. This gives developers a clearer idea what's within the problem boundary of cuspatial, and where to dedicate their contributions to.

Library Design

In general, cuspatial separates compute API and geodataframe APIs. The compute APIs are broadly organized by the data type it operates on (Spatial or Trajectory) and the computation type (generic, indexing, join, filter). We should avoid using general names for the module like generic etc. Developers should strive to find the corresponding subcategory of the compute API, or create a new ones should it does not exist.

Functional v.s. OOP design

cuspatial endorses both functional and OOP design of APIs. GeoSeries and GeoDataframe is the core object the user will interact their geometric/geodesic data with. These APIs will strive to achieve function parity with geopandas. Beyond that, functional APIs supports a superset of functions that cuspatial supports, which may require to interact with custom objects within its own problem domain, such as CubicSpline and quad tree indexing table.

Internal v.s. external

Column level implementation and geoarrow specifications of geoseries is internal API.

Breaking v.s. Non-breaking

Most refactor that goes on here should be non-breaking change as this only affects internal library design. If desired, we may enforce the same user view of cuspatial APIs as the library design in the future.

The below are the proposed new layout:

harrism commented 2 years ago

I don't like "generic". What are the functions that fall within generic? Perhaps we can come up with some categories.

Why doesn't each function get it's own .py file like we do in C++? If that's just "Pythonic", let me know.

isVoid commented 2 years ago

Because "pythonic" :). But that would be an over-simplification. Python files are actually modules. In c++ analogy, it's a combination of header files, namespaces and preprocessing macros (and probably more). A method func defined in func.py will need to be accessed via func.func(). Some preprocessing may be done when python interpreter load the module and inject functions to objects at runtime, like: https://github.com/rapidsai/cudf/blob/c19c8c9406262f16addf2080096338e8fe4a2ba9/python/cudf/cudf/core/dataframe.py#L6871-L6893. These are the reasons why we want to group methods of similar types in the same module.

cc python guru @shwina

harrism commented 2 years ago

So how does this affect how one imports, e.g. derive_trajectories()?

Does it become import cuspatial.core.trajectory.generic.derive_trajectories?

Why not eliminate "core" and "generic" from this so we can just have import cuspatial.trajectory.derive_trajectories?

isVoid commented 2 years ago

A folder is a python package. Methods in the modules can be "exported" by the package by defining them in __init.py__. To write import cuspatial.trajectory.derive_trajectories, you would just need to from .generic import derive_trajectories in trajectory.__init__.py.

isVoid commented 2 years ago

I don't like "generic".

Me neither, for concrete example, there are these methods in spatial.generic.py

def directed_hausdorff_distance(xs, ys, space_offsets):
def haversine_distance(p1_lon, p1_lat, p2_lon, p2_lat):
def lonlat_to_cartesian(origin_lon, origin_lat, input_lon, input_lat):
def polygon_bounding_boxes(poly_offsets, ring_offsets, xs, ys):
def polyline_bounding_boxes(poly_offsets, xs, ys, expansion_radius):
def pairwise_linestring_distance(offsets1, xs1, ys1, offsets2, xs2, ys2):

I think they generally fall into distance, projection and bounding_box categories. But these seems like a sub category when compared to indexing and join. I can definitely be persuaded other ways.

harrism commented 2 years ago

I think distance, projection, bounding, indexing and joining are all on the same footing and should be at the same level.

"joining" is a bit weird, but it grammatically matches. Alternative is infinitive verb form measure, project, bound, index, join.

isVoid commented 2 years ago

In our meeting we decided that interpolate module should be refactored as generic interpolate method for both spatial and time series. And that trajectory package should be removed.

harrism commented 2 years ago

I think we decided that interpolate should be independent of trajectory. I did not hear the part where trajectory should be removed -- there's still useful functionality for trajectories (and RFEs for more).

isVoid commented 2 years ago

I did not hear the part where trajectory should be removed -- there's still useful functionality for trajectories (and RFEs for more).

Only the package - trajectory/ is removed. The functions are "flattened" into trajectory.py module.

harrism commented 2 years ago

I am now having second thoughts on my "infinitive" suggestion, mostly because "measure" is vague where "distance" was quite clear. You can measure a lot of quantities besides distance. I think the original names were better -- do you feel strongly about one or the other? Very sorry for flipping.

isVoid commented 2 years ago

do you feel strongly about one or the other

Not really. I agree that distance is clearer as it's frequently used in literature. I also think it's ok to have exception for joining to be kept in verb form join because people are familiar with it's usage.

harrism commented 2 years ago

Then I would just go with distance, projection, bounding, indexing, join. (the original)