v7labs / darwin-py

Library and commandline tool for managing datasets on darwin.v7labs.com
MIT License
115 stars 42 forks source link

Missing option to set export folder when using pull #806

Closed BartvanMarrewijk closed 3 months ago

BartvanMarrewijk commented 5 months ago

Currently, pulling a release does not allow you to specify an output folder. For example by default it will download the images with following folder structure: /datasets/your_team_name/dataset_name/ and then of course with subfolders of images and releases.

However this is really inconvenient for the deep learning pipelines of my group. We actually do not want to use the parent folders "datasets" and "your_team_name", because it does not matches with our dataset infrastructure.

Is there a solution to tackle this problem? For example alter the local release path? dataset.local_releases_path= "" # Currently not allowed...

    client = Client.from_api_key(MY_API_KEY)

    dataset: RemoteDataset = client.get_remote_dataset(dataset_identifier = "your_team_name/dataset_name")

    release_name: str = "demo-teammeeting"
    # dataset.export(release_name)

    # while True:
    #     print("Waiting for Release to be created...")
    #     sleep(10)
    try:
        print("Trying to get release")
        release: Release = dataset.get_release(release_name)

        print("Got Release, downloading it!")
        dataset.pull(release=release, only_annotations=True, use_folders=True)

        # release.download_zip(Path(f"./{release_name}.zip"))

        break
    except NotFound:
        print("Release not ready yet!")
        continue

Thanks in advance

linear[bot] commented 5 months ago

PY-666 Missing option to set export folder when using pull

JBWilkie commented 3 months ago

Hi @BartvanMarrewijk thanks for getting in touch! While it's not possible to specify a datasets download directory at the time a pull() command is executed, it is possible to configure a different datasets download directory while authenticating

When authenticating, as of this PR it's possible to pass the --datasets_dir option in the command. This will update the directory in your ~/.config.yaml file. You can achieve ths by authenticating as follows:

darwin authenticate --api_key {api_key} --datasets_dir {path/to/desired/directory}

It's also possible to set these options with a .env file using environment variables as follows:

DARWIN_API_KEY=' '
DARWIN_TEAM=' '
DARWIN_DATASETS_DIR=' '
BartvanMarrewijk commented 3 months ago

I still have the same problem after setting the darwin_datasets_dir in the config. It will download the dataset at the specific location, but it will also expand the download folder with the team name. For example": DARWIN_TEAM='my_great_team ' DARWIN_DATASETS_DIR='/c/users/data_folder/ '

Now it will save the data in: '/c/users/data_folder/my_great_team'

But I actually want to save the data in: '/c/users/data_folder/'

JBWilkie commented 3 months ago

I still have the same problem after setting the darwin_datasets_dir in the config. It will download the dataset at the specific location, but it will also expand the download folder with the team name. For example": DARWIN_TEAM='my_great_team ' DARWIN_DATASETS_DIR='/c/users/data_folder/ '

Now it will save the data in: '/c/users/data_folder/my_great_team'

But I actually want to save the data in: '/c/users/data_folder/'

I see, thank you for clarifying @BartvanMarrewijk ! Unfortunately, including the team name in the file path is necessary. This is because for users who are a part of multiple teams, including the team name provides a way of mapping datasets to teams. It also prevents overwriting of dataset releases in case there are datasets with identical names in different teams

BartvanMarrewijk commented 3 months ago

Mmm a pity, I do think that it should be done by default, but it feels counter intuitive that after you set the dataset folder, another folder is created. Identical names in different teams would be a bit counterintuitive as it assumes you are working with the same data at different teams. I cannot find any reason why people would do that, and if it occurs, by default noting will happen right

JBWilkie commented 3 months ago

Mmm a pity, I do think that it should be done by default, but it feels counter intuitive that after you set the dataset folder, another folder is created. Identical names in different teams would be a bit counterintuitive as it assumes you are working with the same data at different teams. I cannot find any reason why people would do that, and if it occurs, by default noting will happen right

Hi Bart, thanks for getting back to me! I understand this might be a little counterintuitive. Unfortunately, we do have a number of users who rely on this functionality. Therefore, if you require pulled files to be in a different structure, it will be necessary to move them after the pull() operation is complete.