ability to specify indexing options with a file

jonburdo commented 2 years ago

We should add an option (--spec-file or something) which takes a filepath (any local or remote urlpath) for a single file containing a list of urlpaths to index, and optionally some other information (in yaml or toml maybe), such as:

a prefix - this way the urlpaths can be relative to this prefix instead of absolute paths
indexing format

This way a user could index a specific set of files instead of an entire directory, and this makes it easy to index files hosted at http endpoints:

ldb index --spec-file https://remote.ldb.ai/path/to/spec-file.yaml

volkfox commented 2 years ago

We have an alternative in the form of Label Studio format.

Suggest to close this.

jonburdo commented 2 years ago

This suggestion is a little different. In label studio format, the annotation content is inside this file. This is a proposition to specify all files as paths, including annotations. So this could be used with any other indexing format.

In label studio you might have data.json with:

[
  {
    "annotations": [
      {
        "class": "cat"
      }
    ],
    "data": {
      "image": "https://remote.ldb.ai/data-lakes/dogs-and-cats/cat.1000.png"
    },
    "predictions": []
  },
  {
    "annotations": [
      {
        "class": "dog"
      }
    ],
    "data": {
      "image": "https://remote.ldb.ai/data-lakes/dogs-and-cats/dog.1000.png"
    },
    "predictions": []
  }
]

In this suggestion, you might have data.json:

[
  "https://remote.ldb.ai/data-lakes/dogs-and-cats/cat.1000.png",
  "https://remote.ldb.ai/data-lakes/dogs-and-cats/cat.1000.json",
  "https://remote.ldb.ai/data-lakes/dogs-and-cats/dog.1000.png",
  "https://remote.ldb.ai/data-lakes/dogs-and-cats/dog.1000.json"
]

and index with ldb index --format pairs --spec-file data.json.

The file here provides the file list that ldb otherwise produces by recursively traversing a directory. This is useful for a couple of scenarios:

dealing with http, where we don't have directories
indexing a specific subset of the data in a data lake rather than recursively traversing the whole thing

Essentially this would be a convenient way to avoid passing many individual filepaths as separate args to ldb index (i.e. similar to how we can use a requirements.txt file with pip)

iterative / ldb

ability to specify indexing options with a file #220