v7labs / darwin-py

Library and commandline tool for managing datasets on darwin.v7labs.com
MIT License
115 stars 42 forks source link

darwin dataset pull overwrites images with same name, unexpected behavioral change between 0.8.x and 1.0.x #884

Closed filippocastelli closed 1 month ago

filippocastelli commented 2 months ago

using darwin dataset pull without specifying --folders results in missing image files when multiple remote files share the same filename.

Please notice that this is the same issue as #603 , which was solved somewhere around 0.8.44 and most likely reintroduced by #872 .

This unexpected behavioural change on core features of the package like dataset pulling is very disruptive for customer workflows depending on darwin-py.

below the steps to reproduce

(condaenv) phil@gondolin:~/.darwin/datasets$ python -c "import darwin; print(darwin.__version__)"
1.0.1
(condaenv) phil@gondolin:~/.darwin/datasets$ darwin dataset pull v7user/datasetslug:release
Going to download 195 files to /home/phil/.darwin/datasets/v7user/datasetslug/images .
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total file count after download completed 80.
Dataset v7user/datasetslug:release downloaded at /home/phil/.darwin/datasets/v7user/datasetslug .
(condaenv) phil@gondolin:~/.darwin/datasets$ find ./ -type f \( -iname \*.json \) | grep -v "metadata" | wc -l
195
(condaenv) phil@gondolin:~/.darwin/datasets$ find ./ -type f \( -iname \*.jpg -o -iname \*.png \) | wc -l
80
(condaenv) phil@gondolin:~/.darwin/datasets$ rm -R v7user/
(condaenv) phil@gondolin:~/.darwin/datasets$ pip install darwin-py==0.8.62
Collecting darwin-py==0.8.62
[...]
Successfully installed darwin-py-0.8.62
(condaenv) phil@gondolin:~/.darwin/datasets$ darwin dataset pull v7user/datasetslug:release
Going to download 195 files to /home/phil/.darwin/datasets/v7user/datasetslug/images .
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total file count after download completed 195.
A newer version of darwin-py (1.0.1) is available!
Run the following command to install it:

    pip install darwin-py==1.0.1

Dataset v7user/datasetslug:release downloaded at /home/phil/.darwin/datasets/v7user/datasetslug .
(condaenv) phil@gondolin:~/.darwin/datasets$ find ./ -type f \( -iname \*.json \) | grep -v "metadata" | wc -l
195
(condaenv) phil@gondolin:~/.darwin/datasets$ find ./ -type f \( -iname \*.jpg -o -iname \*.png \) | wc -l
195
linear[bot] commented 2 months ago

DAR-2991 darwin dataset pull overwrites images with same name, unexpected behavioral change between 0.8.x and 1.0.x

JBWilkie commented 2 months ago

Hi @filippocastelli, thanks for raising this with us!

When we released the major version 1.0.0, we made some breaking changes to how darwin-py names files when pulling & loading data. These changes were made in a bid to improve the coherence of the naming conventions of in-platform and locally downloaded files. One of those changes involved naming items after their in-platform name (item name) instead of the name of the annotation file, which previously was the case.

Another of these changes was to change the default behaviour of RemoteDataset.pull() to pull with folders, instead of in a flat structure which previously was the case. Unfortunately, there was an oversight and we did not change this behaviour for CLI-initiated pull operations. This was a mistake and we apologise for the issue. The combination of the above behaivour change and the oversight leads to overwriting.

To rectify it, work has been done on the DAR-2991 branch and a PR has been opened with the following changes:

Because the changes were made to bring greater coherence between the names of in-platform items and local items, unfortunately we won't be reverting to the behaviour that employs _n suffixes to ensure uniqueness. In advance of the release, emails containing information on the changes were sent to every Darwin team.

You'll be updated as soon the these changes are available in a darwin-py release!

filippocastelli commented 2 months ago

Thank you for the feedback, the reasons for this behavior change are justified.

JBWilkie commented 1 month ago

Hi @filippocastelli The above changes are now available in version 1.0.3 released today. pull() from the CLI will now pull with folders by default. You can still pull a flat structure with the new --no-folders flag, and a non-blocking warning will be displayed for every file that's going to be overwritten

filippocastelli commented 1 month ago

thank you!