iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

Update `example-dvc-experiments` with `dvc exp init` and confusion matrix #97

Closed iesahin closed 2 years ago

iesahin commented 2 years ago

Updates https://github.com/iterative/example-dvc-experiments with

Closes #96

shcheklein commented 2 years ago

@iesahin could you please add the context in the description? why, previous discussions, etc. That would help to understand this :)

iesahin commented 2 years ago

could you please add the context in the description? why, previous discussions

I shouldn't add these just before the meeting :)

shcheklein commented 2 years ago

Thanks, Emre. It's still not enough context to meaningfully review this :(

Why do we need this repo? What is the plan - replace existing? keep both? etc?

What was the motivation behind doing this?

iesahin commented 2 years ago

Thanks, Emre. It's still not enough context to meaningfully review this :(

Why do we need this repo? What is the plan - replace existing? keep both? etc?

Actually, this was a temporary PR. Your confusion is because of not marking this as a draft or WIP I think. Sorry.

I'm testing how to generate a repository based on dvc exp init. But from my tests, the current exp init doesn't provide much faster intro to experiments. This is because:

Basically, what (the current) dvc exp init does is something like dvc stage add with some sane defaults. (dvc exp init --interactive fills the pipeline elements by asking the user.) In the current intro to experiments,, we assume there is already a pipeline. If we remove that assumption and try to create a pipeline with dvc exp init, it will require more preparation to get to dvc exp run. Currently, we're hiding everything to a details section, but if we intend to start the GS:Experiments with dvc exp init, we'll first ask the user to dvc init, then dvc exp init, then dvc add data/ (which takes around 5 minutes to add 70K small files in the current dataset), then they'll be able to reach to a point that dvc exp run. From our previous discussions I know you would like to see dvc exp run as the first command, or at least on the first page. This is probably not possible if we use dvc exp init to initialize the project.

Another point is to make the experiment in a single stage if we use dvc exp init. In the current project, we have two stages. We dvc pull a single .tar.gz file, then extract stage splits this into 70K individual .png files, and train stage works with these individual files. If we'll use a single stage, either:

Option (b) proved to be too slow, will take at least 20-25 minutes to download, and I know (from our previous discussions) you don't want to work on a single file as the dataset as in option (a).

What was the motivation behind doing this?

My motivation was testing dvc exp init with the current dataset. I think we should keep the DVCLive one as the next iteration of experiments. When dvc exp init removes these dvc init and dvc add requirements, we can return to this project once more. WDYT? @shcheklein

shcheklein commented 2 years ago

cc @dberenbaum @efiop

How about a separate section that is focused more on dvc exp init itself? "Initialize Project"?

which takes around 5 minutes to add 70K small files in the current dataset

is it still the case? There were some improvements as far as I know ... could you point me to the dataset please to experiment a bit?

Option (b) proved to be too slow, will take at least 20-25 minutes to download

it seems realistically, DVC doesn't handle 70K at the moment ... at least for the quick start/get started project where speed is important

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

dberenbaum commented 2 years ago

What about starting by replacing the hidden Installing the example project section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:

It's a lot of steps (basically what's there now plus dvc exp init), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.

No strong opinion on whether to keep this hidden or make a new section for it.

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.

iesahin commented 2 years ago

I've added some time commands to the repository generation. These run on DVC master by installing to a venv.

These are on WSL with a fairly good Windows laptop. I'll also test these on Google Cloud VM. You can test these yourselves by generating the repository with this branch: example-dvc-exp-init/generate.bash.

Some results:

time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [10:59, 659.43s/file]                                                                                                                                                                                     To track the changes with git, run:                                                                                                                                                                                                                                                                                                                                               git add data.dvc .gitignore

real    11m1.916s
user    9m51.770s
sys     5m47.967s
time dvc init
...
real    0m0.682s
user    0m0.530s
sys     0m0.088s
time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py
Created default stage in dvc.yaml. To run, use "dvc exp run".
See https://dvc.org/doc/user-guide/experiment-management/running-experiments.

real    0m1.280s
user    0m1.179s
sys     0m0.083s

The following are for dvc exp run running src/train.py. Absolute times depend on train.py, but dvc exp run --queue takes around 40 seconds, and dvc exp run --run-all --jobs 2 doesn't lead to ~50% shorter times because of dvc checkout. (Actually per experiment time is around 2x with --queue.)

dvc exp run 
real    4m29.974s
user    10m53.876s
sys     0m59.886s
time dvc exp run -n cnn-32 --queue -S model.conv_units=32
+ dvc exp run -n cnn-32 --queue -S model.conv_units=32
Queued experiment '5235904' for future execution.

real    0m42.076s
user    0m32.779s
sys     0m5.813s
time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64
Queued experiment '0bf5164' for future execution.

real    0m40.619s
user    0m32.523s
sys     0m4.828s

The following is for 4 experiments, set to run 2-by-2 in parallel. Note that plain dvc exp run takes around 4 minutes, and our expected results should be around 8-9 minutes (in total) for this case.

time dvc exp run --run-all --jobs 2
...
Reproduced experiment(s): cnn-128, cnn-96, cnn-64, cnn-32
...
real    42m9.655s
user    166m22.484s
sys     15m53.164s

And finally, :)

time dvc exp show --no-pager
+ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment            ┃ Created      ┃    loss ┃    acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace             │ -            │ 0.24566 │  0.908 │ 10           │ 16               │
│ baseline-experiment   │ Dec 02, 2021 │ 0.24566 │  0.908 │ 10           │ 16               │
│ ├── c9fb827 [cnn-64]  │ 03:23 PM     │ 0.23653 │ 0.9143 │ 10           │ 64               │
│ ├── f60d42c [cnn-32]  │ 03:22 PM     │ 0.23957 │  0.912 │ 10           │ 32               │
│ ├── 4141ac4 [cnn-128] │ 03:09 PM     │ 0.23462 │ 0.9174 │ 10           │ 128              │
│ └── 6a0bfa7 [cnn-96]  │ 03:05 PM     │ 0.25099 │ 0.9133 │ 10           │ 96               │
└───────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘

real    0m1.490s
user    0m0.892s
sys     0m0.189s
iesahin commented 2 years ago

The following are the time results on a Google Cloud VM:

time dvc get https://github.com/iterative/dataset-registry \
        fashion-mnist/images.tar.gz -o images.tar.gz
+ dvc get https://github.com/iterative/dataset-registry fashion-mnist/images.tar.gz -o images.tar.gz

real    0m3.536s
user    0m1.181s
sys     0m0.292s
time tar xvzf images.tar.gz
+ tar xvzf images.tar.gz

real    0m2.643s
user    0m0.916s
sys     0m2.045s
popd
+ popd

time dvc init
+ dvc init

real    0m8.066s
user    0m0.446s
sys     0m0.092s

# tag_tick
# git add .dvc
# git commit -m "Initialized DVC"
# git tag "dvc-init"
#
# dvc add data/images.tar.gz

time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py

real    0m3.231s                                                                                                                                                                     user    0m1.177s
sys     0m0.141s

time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [05:52, 352.93s/file]

real    5m55.554s
user    4m37.304s                                                                                                                                                                    
sys     1m1.456s

time dvc exp run
+ dvc exp run

real    7m0.122s
user    6m52.626s
sys     0m15.181s

time dvc exp run -n cnn-32 --queue -S model.conv_units=32                                                                                                                            + dvc exp run -n cnn-32 --queue -S model.conv_units=32                                                                                                                                                                                                                                                                                                                    real    3m5.789s                                                                                                                                                                     
user    0m33.405s                                                                                                                                                                    
sys     0m5.045s
time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64

real    3m6.764s
user    0m33.531s                                                                                                                                                                    
sys     0m5.153s
time dvc exp run -n cnn-96 --queue -S model.conv_units=96
+ dvc exp run -n cnn-96 --queue -S model.conv_units=96

real    3m6.632s
user    0m33.511s
sys     0m5.089s
time dvc exp run -n cnn-128 --queue -S model.conv_units=128
+ dvc exp run -n cnn-128 --queue -S model.conv_units=128

real    3m5.999s
user    0m33.134s
sys     0m5.062s

time dvc exp run --run-all --jobs 2                                                                                                                                                  
+ dvc exp run --run-all --jobs 2                                                                                                                                                                                                                                                                                                                                          real    39m29.080s                                                                                                                                                                   
user    70m26.471s                                                                                                                                                                   
sys     5m20.166s                                                                                                                                                                                                                                                                                                                                                         

time dvc exp show --no-pager                                                                                                                                                         
+ dvc exp show --no-pager                                                                                                                                                                                                                                                                                                                                                 real    0m1.650s                                                                                                                                                                     
user    0m0.960s                                                                                                                                                                     
sys     0m0.107s                                          

Please note that the difference between parallel dvc exp run vs. the serial one. Running an experiment with dvc exp run --queue or --temp takes about 2x more per experiment.

Also, in this VM case, adding to the experiment queue takes around 3 minutes, vs 40 seconds on WSL. No other major processes were running during this test.

BTW, a plain python src/train.py takes around 4 minutes on this VM.

iesahin commented 2 years ago

What about starting by replacing the hidden Installing the example project section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:

* download/clone repo with code + params.yaml + requirements.txt

* virtualenv setup

* dvc init

* dvc import data

* dvc exp init

It's a lot of steps (basically what's there now plus dvc exp init), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.

This is certainly possible, though I'm not sure if it's worth it. I was expecting dvc exp init will make this setup smoother, without additional needs for dvc init, dvc add or dvc import.

Another problem is the performance of dvc add and dvc import. To use dvc exp init, we require the experiment to have a single stage, and ideally, this single stage must use separate images in data/ as input. With 5-10 minutes to dvc add data/, or 20-30 minutes to dvc import data/, I doubt users will want to use DVC in another project even if they are patient enough to complete the hands-on tutorial.

No strong opinion on whether to keep this hidden or make a new section for it.

should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?

Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.

This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.

Let me ask this straight, would you use DVC in a project with millions of files?

dberenbaum commented 2 years ago

This is certainly possible, though I'm not sure if it's worth it. I was expecting dvc exp init will make this setup smoother, without additional needs for dvc init, dvc add or dvc import.

Sorry, I may have given the wrong impression. Those features would be nice, but the primary purpose is to help users get started with experiments. The hope is that dvc exp init -i in particular provides a more user-friendly onboarding to experiments than dvc stage add ... that runs on for multiple lines with arcane arguments that each introduce a completely foreign concept to new users.

As far as needing additional commands, auto dvc init would be nice but is at least lightweight and self-explanatory. Auto dvc add is more important, but we still need some command for users to get the data initially, right? Does it save any steps in this particular workflow? cc @skshetry

This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.

Let me ask this straight, would you use DVC in a project with millions of files?

Maybe not -- I'm not really sure today. We are in the middle of changes to address these performance issues, especially for many files (not to mention we have an entirely new product being developed specifically to address this type of scenario). Please continue to comment in relevant issues in the core repo and open issues from your findings here. Maybe we can use these in dvc benchmarks. In the meantime, we still need to address docs needs.

FWIW, my experience is that I have used it for data in the 100s of GBs and found it extremely useful. I think it can feel slow and better performance would have a major impact, but I want to clarify from both personal experience and community interactions that it is useful today in real-world applications. A few points might explain this:


While we wait for performance to improve, what other options do we have to move the docs forward?

Any other ideas?

iesahin commented 2 years ago

Dave,

When we discussed this topic a few months ago, Ivan assured me that the core team has a plan regarding these issues. I'm in no position to decide whether that plan is feasible or not, (and certainly I never intend to be a manager or criticize anyone or push the team to certain direction) but the current situation is not impressive, and I feel frustrated when it comes to tell features of a product that I cannot use pleasantly.

Note that my concerns are concerns of a user, not someone who's making decisions about the project. I used to use DVC to track my personal collections, but currently I don't. When I'm using our own product only because of the professional reasons, I believe that's a red flag.

I can write tickets, but I don't think the gravity of the situation is well understood. Performance (and security) are two aspects that you cannot add to a software project later, they are not like features that you can add at a certain point. Every technical decision regarding features must be made also considering its effect on performance (and security.)

Regarding the particular changes for the example project: I think we can keep the current docs and the project until the performance issues are resolved. I can convert the project to its original format, where the images are loaded from a single file, but I believe that's not @shcheklein would want.

shcheklein commented 2 years ago

@iesahin I think Dave and the team are very clearly understand the problem and are trying to address it as fast as possible. No one was saying that performance or security are not important. I see your frustration, but the part of building things fast in the early stage environment. We need to adapt quicker and find some workarounds faster. Let's try to discuss some options please and try to help the team as much as we can.


@dberenbaum

Keep as an archive and extract as part of the utility functions in the stage.

that's what we already do, but this would complicate the dvc exp init, right? that was the initial concern with trying to transition the project to dvc exp init

Use a subset of data.

probably won't work either - still many files to do it quick

Use a different dataset (not focused on many small files).

what are the options here? NLP problems with DL on a text file?

dberenbaum commented 2 years ago

@iesahin Are you following https://github.com/iterative/dvc-bench? @efiop and others are already working on improvements there, but your input can be helpful.

Keep as an archive and extract as part of the utility functions in the stage.

@shcheklein What I mean here is that we have single stage, and inside train.py it does the extraction on the fly.

shcheklein commented 2 years ago

@dberenbaum got it, @iesahin is it feasible? ( I remember we had some code that was reading archive on the "fly") ... may be we could even use something like hd5?

shcheklein commented 2 years ago

Or TensorFlow datasets/formats that package data?

iesahin commented 2 years ago

The earliest version of this project was using MNIST's custom image format to obtain the images on the fly. (It was in IPX format and generating numpy arrays from them.) We can revert to that if it sounds good.

Another option is to use a single tar file that contain PNG images. Python supports tar in the standard library.

We can also convert the project to a single file NLP project, similar to example-get-started, but I don't see it's necessary and either of the above two approaches will probably suffice.

@shcheklein @dberenbaum

shcheklein commented 2 years ago

Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.

iesahin commented 2 years ago

Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.

Thinking about this in the initial version, I've decided that "going with the default, as distributed from the dataset website" is more "excusable." Though it's easier to make a classifier that way, I don't like to use Tensorflow datasets, as we have the corresponding functionality in the dataset-registry.

I'm using the tar version if that's the deal? @shcheklein

iesahin commented 2 years ago

I've uploaded a version to https://github.com/iterative/example-dvc-exp-init. This is a staging version, I'll test and fix bugs in this repository, then move to https://github.com/iterative/example-dvc-experiments

iesahin commented 2 years ago

This is ready for review and merge. @shcheklein @dberenbaum