kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.51k stars 658 forks source link

spatial dataset training functions #2141

Open Jo316 opened 3 weeks ago

Jo316 commented 3 weeks ago

What you would like to be added?

I would like to request the addition of functions to the Training Operator for training models with spatial (geographical) datasets. These functions should enable seamless integration and processing of geographical data, leveraging state-of-the-art algorithms to enhance model accuracy and applicability in spatial contexts.

One potential reference is the R package CAST, which provides robust functions for training models with geographical data using random forest. The package offers a comprehensive approach to handling spatial data, including considerations for the Area of Applicability.

Functions Ranked By Importance/ Need (https://hannameyer.github.io/CAST/reference/index.html:

Why is this needed?

The integration of spatial dataset training functions will significantly enhance the Training Operator's capabilities, particularly for users working with geographical data. It will allow for more accurate and relevant model training in fields such as environmental science, urban planning, and geospatial analysis.

By incorporating these functions, the Training Operator will support a wider range of use cases and applications, making it a more versatile and powerful tool for data scientists and researchers.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 2 weeks ago

Thank you for creating this @Jo316!

Please can you explain what specific functionality are you looking from Training Operator to support training models with spatial datasets ? Do you require some distributed capabilities and you want to leverage Training Operator controller to orchestrate the appropriate resources on Kubernetes ?

As long as you can create container from your training script where you use the geographical datasets, you can run it within Training Operator.