[FR] Support CVAT Dataset Repositories

niclaswue commented 3 years ago

Proposal Summary

Support for dataset repositories in CVAT during task creation and import.

Motivation

What is the use case for this feature? Ideally, the repository information can be sent to CVAT when creating a task in fiftyone. When importing the labeled data back to fiftyone, the labels are automatically pushed to the specified repository using the endpoint <cvat_host>/git/repository/push/<task_id>
Why is this use case valuable to support for FiftyOne users in general? It allows for easy backups of the labeled data, along with information about the labeler, job reviewer etc.
Why is this use case valuable to support for your project(s) or organization? We want to use the CVAT dataset repository field for synchronization and backups when a labeling task is finished. It is also important to us, to save the metadata about labeler and reviewer to assure high dataset quality.
Why is it currently difficult to achieve this use case? (please be as specific as possible about why related FiftyOne features and components are insufficient) To the best of my knowledge, this feature is not supported at the moment. I saw it's possible to set the values for job_assignees, job_reviewers etc. in CVATBackendConfig but there is no option for a dataset repository or did I overlook it?

What areas of FiftyOne does this feature affect?

[ ] App: FiftyOne application
[x] Core: Core fiftyone Python library
[ ] Server: FiftyOne server

Willingness to contribute

The FiftyOne Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

[ ] Yes. I can contribute this feature independently.
[x] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community.
[ ] No. I cannot contribute this feature at this time.

niclaswue commented 2 years ago

My workaround for now is to create and push to the repository when the task is finished. For this, I used the following endpoints:

cvat = CVATAnnotationAPI(...)
task_id = ... # get from dataset.load_annotation_results(anno_key).get_status()
file_path = f"labels/{anno_key}.xml"
repo_patch = { "path": f"{dataset_repository} [{file_path}]", "lfs": False }  # format seems to be ignored
response_create = cvat.patch(f"{cvat.base_url}/git/repository/create/{task_id}", json=repo_patch)
time.sleep(10)  # give cvat some time to clone
response_push = cvat.get(f"{cvat.base_url}/git/repository/push/{task_id}")
assert response_create.status_code == 200
assert response_push.status_code == 200

ehofesmann commented 2 years ago

@niclaswue I have made a proof on concept for this FR, however, it seems that a better workflow would be to avoid using the git repository altogether and instead back up annotations within FiftyOne directly.

Cloning the dataset is an efficient way to make a backup of the metadata it contains: dataset2 = dataset.clone()
You're right that job_assignees, task_assignee, and job_reviewers can be used to specify those parameters programmatically. You can then get the status of an annotation run containing assignee and reviewer information that you can then store back in the dataset in whatever way you want.

Combining these two workflows should let you avoid needing to upload to a dataset repository.

brimoor commented 2 years ago

Hey @niclaswue it sounds like you might want to hear more about FiftyOne Teams.

That's our mechanism for providing features like versioning and permissions for production ML workflows :)

niclaswue commented 2 years ago

Thank you very much, I found it very convenient to pass the repository info to CVAT and not deal with a git library for pushing manually. However, when adding additional information in fiftyone, this information is of course lost when transferring to CVAT, so I might take a look at this in the future. Right now we are just at the very early testing stages of our pipeline. Thanks again :)

voxel51 / fiftyone