⚠️ WARNING: sknnr is in active development! ⚠️
sknnr
is a package for running k-nearest neighbor (kNN) imputation[^imputation] methods using estimators that are fully compatible with scikit-learn
. Notably, common methods such as most similar neighbor (MSN, Moeur & Stage 1995), gradient nearest neighbor (GNN, Ohmann & Gregory, 2002), and random forest nearest neighbors[^rfnn] (RFNN, Crookston & Finley, 2008) are included in this package.
scikit-learn
APIpandas
dataframessknnr
is an acronym of its main three components:
scikit-learn
. All estimators in this package derive from the sklearn.BaseEstimator
class and comply with the requirements associated with developing custom estimators.sknnr
estimator, like MSNRegressor, as a drop-in replacement for a scikit-learn
regressor.
from sknnr import MSNRegressor
est = MSNRegressor()
3. Load a custom dataset like [SWO Ecoplot](https://sknnr.readthedocs.io/en/latest/api/datasets/swo_ecoplot) (or bring your own).
```python
from sknnr.datasets import load_swo_ecoplot
X, y = load_swo_ecoplot(return_X_y=True, as_frame=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
est = est.fit(X_train, y_train) est.score(X_test, y_test)
5. Check out the additional features like [independent scoring](https://sknnr.readthedocs.io/en/latest/usage/#independent-scores-and-predictions), [dataframe indexing](https://sknnr.readthedocs.io/en/latest/usage/#retrieving-dataframe-indexes), and [dimensionality reduction](https://sknnr.readthedocs.io/en/latest/usage/#dimensionality-reduction).
```python
# Evaluate the model using the second-nearest neighbor in the test set
print(est.fit(X, y).independent_score_)
# Get the dataframe index of the nearest neighbor to each plot
print(est.kneighbors(return_dataframe_index=True, return_distance=False))
# Apply dimensionality reduction using CCorA ordination
MSNRegressor(n_components=3).fit(X_train, y_train)
sknnr
was heavily inspired by (and endeavors to implement functionality of) the yaImpute package for R (Crookston & Finley 2008). As Crookston and Finley (2008) note in their abstract,
Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping ... [there is] a growing interest in nearest neighbor imputation methods for spatially explicit forest inventory, and a need within this research community for software that facilitates comparison among different nearest neighbor search algorithms and subsequent imputation techniques.
Indeed, many regional (e.g. LEMMA) and national (e.g. BIGMAP, TreeMap) projects use nearest-neighbor methods to estimate and map forest attributes across time and space.
To that end, sknnr
ports and expands the functionality present in yaImpute
into a Python package that helps facilitate intercomparison between k-nearest neighbor methods (and other built-in estimators from scikit-learn
) using an API which is familiar to scikit-learn
users.
Thanks to Andrew Hudak (USDA Forest Service Rocky Mountain Research Station) for the inclusion of the Moscow Mountain / St. Joes dataset (Hudak 2010), and the USDA Forest Service Region 6 Ecology Team for the inclusion of the SWO Ecoplot dataset (Atzet et al., 1996). Development of this package was funded by:
[^imputation]: In a mapping context, kNN imputation refers to predicting feature values for a target from its k-nearest neighbors, and should not be confused with the usual scikit-learn
usage as a pre-filling strategy for missing input data, e.g. KNNImputer
.
[^rfnn]: In development!
[^validation]: All estimators and parameters with equivalent functionality in yaImpute
are tested to 3 decimal places against the R package.