opensearch-project / k-NN

πŸ†• Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Fix KNN module to avoid invalid characters to be included as a part of file name. #1934

Closed 0ctopus13prime closed 3 months ago

0ctopus13prime commented 3 months ago

Description

Issue : https://github.com/opensearch-project/k-NN/issues/1859.

Issue

While OpenSearch does allow for a field name to have an empty space within it and it disallows an empty space to be contained in a physical file name, KNNCodecUtil::buildEngineFileName uses the field name as a part of a vector file name directly without any proper adjustment to be made. As a result, if the field name does not meet the valid criteria, it can fail due to failing the validation in BlobStoreIndexShardSnapshot. For example, _0_2011_my vector.hnswc ('my vector' is the field name). As a result, BlobStoreIndexShardSnapshot throws an exception complaining file name is not valid.

Solution

In KNNCodecUtil, we need to add a method to escape invalid characters in the field name when creating file name. In this solution proposal, I'm suggesting to escape it to '.'. For example, _0_2011_my vector.hnswc β†’ _0_2011_my.vector.hnswc.

Field name does not have "." inside, as it indicates more than one field name if the dot presents. For example, a.b.c indicates three fields with a, b and c.

AS-IS:

public static String buildEngineFileSuffix(String fieldName, String extension) {
    return String.format("_%s%s", fieldName, extension);
}

TO-BE:

public static String buildEngineFileSuffix(String fieldName, String extension) {
    return String.format("_%s%s", escapeSpaceInFieldName(fieldName), extension);
}

private static String escapeSpaceInFieldName(String fieldName) {
    char[] characters = fieldName.toCharArray();
    for (int i = 0 ; i < characters.length ; ++i) {
        if (Strings.INVALID_FILENAME_CHARS.contains(characters[i])) {
            characters[i] = '.';  // Overwrite with an escaping character.
        }
    }
    return new String(characters);
}

Related Issues

Issue : https://github.com/opensearch-project/k-NN/issues/1859.

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

navneet1v commented 3 months ago

@0ctopus13prime please fix the conflicts. and DCO check.

0ctopus13prime commented 3 months ago

Will discard this PR and let me re-raise a new PR that prohibits all disallowed characters for any vector fields.

0ctopus13prime commented 3 months ago

Raised a new PR to block invalid characters for a physical file name entirely

https://github.com/opensearch-project/k-NN/pull/1936