Closed iesahin closed 2 years ago
@iesahin could you please add the context in the description? why, previous discussions, etc. That would help to understand this :)
could you please add the context in the description? why, previous discussions
I shouldn't add these just before the meeting :)
Thanks, Emre. It's still not enough context to meaningfully review this :(
Why do we need this repo? What is the plan - replace existing? keep both? etc?
What was the motivation behind doing this?
Thanks, Emre. It's still not enough context to meaningfully review this :(
Why do we need this repo? What is the plan - replace existing? keep both? etc?
Actually, this was a temporary PR. Your confusion is because of not marking this as a draft or WIP I think. Sorry.
I'm testing how to generate a repository based on dvc exp init
. But from my tests, the current exp init
doesn't provide much faster intro to experiments. This is because:
(a) dvc init
is still needed before dvc exp init.
(b) dvc add data/
is still needed after dvc exp init
.
Basically, what (the current) dvc exp init
does is something like dvc stage add
with some sane defaults. (dvc exp init --interactive
fills the pipeline elements by asking the user.) In the current intro to experiments,, we assume there is already a pipeline. If we remove that assumption and try to create a pipeline with dvc exp init
, it will require more preparation to get to dvc exp run
. Currently, we're hiding everything to a details
section, but if we intend to start the GS:Experiments with dvc exp init
, we'll first ask the user to dvc init
, then dvc exp init
, then dvc add data/
(which takes around 5 minutes to add 70K small files in the current dataset), then they'll be able to reach to a point that dvc exp run
. From our previous discussions I know you would like to see dvc exp run
as the first command, or at least on the first page. This is probably not possible if we use dvc exp init
to initialize the project.
Another point is to make the experiment in a single stage if we use dvc exp init.
In the current project, we have two stages. We dvc pull
a single .tar.gz
file, then extract
stage splits this into 70K individual .png
files, and train
stage works with these individual files. If we'll use a single stage, either:
(a) we'll merge the extract
stage to train.py
script, that is the training will work on the .tar.gz
file, or,
(b) we'll dvc pull
70K individual files from the remote to feed into train.py
.
Option (b) proved to be too slow, will take at least 20-25 minutes to download, and I know (from our previous discussions) you don't want to work on a single file as the dataset as in option (a).
What was the motivation behind doing this?
My motivation was testing dvc exp init
with the current dataset. I think we should keep the DVCLive one as the next iteration of experiments. When dvc exp init
removes these dvc init
and dvc add
requirements, we can return to this project once more. WDYT? @shcheklein
cc @dberenbaum @efiop
How about a separate section that is focused more on dvc exp init
itself? "Initialize Project"?
which takes around 5 minutes to add 70K small files in the current dataset
is it still the case? There were some improvements as far as I know ... could you point me to the dataset please to experiment a bit?
Option (b) proved to be too slow, will take at least 20-25 minutes to download
it seems realistically, DVC doesn't handle 70K at the moment ... at least for the quick start/get started project where speed is important
should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?
What about starting by replacing the hidden Installing the example project
section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:
It's a lot of steps (basically what's there now plus dvc exp init
), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.
No strong opinion on whether to keep this hidden or make a new section for it.
should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?
Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.
I've added some time
commands to the repository generation. These run on DVC master
by installing to a venv.
These are on WSL with a fairly good Windows laptop. I'll also test these on Google Cloud VM. You can test these yourselves by generating the repository with this branch: example-dvc-exp-init/generate.bash
.
Some results:
time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [10:59, 659.43s/file] To track the changes with git, run: git add data.dvc .gitignore
real 11m1.916s
user 9m51.770s
sys 5m47.967s
time dvc init
...
real 0m0.682s
user 0m0.530s
sys 0m0.088s
time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py
Created default stage in dvc.yaml. To run, use "dvc exp run".
See https://dvc.org/doc/user-guide/experiment-management/running-experiments.
real 0m1.280s
user 0m1.179s
sys 0m0.083s
The following are for dvc exp run
running src/train.py
. Absolute times depend on train.py
, but dvc exp run --queue
takes around 40 seconds, and dvc exp run --run-all --jobs 2
doesn't lead to ~50% shorter times because of dvc checkout
. (Actually per experiment time is around 2x with --queue
.)
dvc exp run
real 4m29.974s
user 10m53.876s
sys 0m59.886s
time dvc exp run -n cnn-32 --queue -S model.conv_units=32
+ dvc exp run -n cnn-32 --queue -S model.conv_units=32
Queued experiment '5235904' for future execution.
real 0m42.076s
user 0m32.779s
sys 0m5.813s
time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64
Queued experiment '0bf5164' for future execution.
real 0m40.619s
user 0m32.523s
sys 0m4.828s
The following is for 4 experiments, set to run 2-by-2 in parallel. Note that plain dvc exp run
takes around 4 minutes, and our expected results should be around 8-9 minutes (in total) for this case.
time dvc exp run --run-all --jobs 2
...
Reproduced experiment(s): cnn-128, cnn-96, cnn-64, cnn-32
...
real 42m9.655s
user 166m22.484s
sys 15m53.164s
And finally, :)
time dvc exp show --no-pager
+ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Experiment ┃ Created ┃ loss ┃ acc ┃ train.epochs ┃ model.conv_units ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace │ - │ 0.24566 │ 0.908 │ 10 │ 16 │
│ baseline-experiment │ Dec 02, 2021 │ 0.24566 │ 0.908 │ 10 │ 16 │
│ ├── c9fb827 [cnn-64] │ 03:23 PM │ 0.23653 │ 0.9143 │ 10 │ 64 │
│ ├── f60d42c [cnn-32] │ 03:22 PM │ 0.23957 │ 0.912 │ 10 │ 32 │
│ ├── 4141ac4 [cnn-128] │ 03:09 PM │ 0.23462 │ 0.9174 │ 10 │ 128 │
│ └── 6a0bfa7 [cnn-96] │ 03:05 PM │ 0.25099 │ 0.9133 │ 10 │ 96 │
└───────────────────────┴──────────────┴─────────┴────────┴──────────────┴──────────────────┘
real 0m1.490s
user 0m0.892s
sys 0m0.189s
The following are the time
results on a Google Cloud VM:
time dvc get https://github.com/iterative/dataset-registry \
fashion-mnist/images.tar.gz -o images.tar.gz
+ dvc get https://github.com/iterative/dataset-registry fashion-mnist/images.tar.gz -o images.tar.gz
real 0m3.536s
user 0m1.181s
sys 0m0.292s
time tar xvzf images.tar.gz
+ tar xvzf images.tar.gz
real 0m2.643s
user 0m0.916s
sys 0m2.045s
popd
+ popd
time dvc init
+ dvc init
real 0m8.066s
user 0m0.446s
sys 0m0.092s
# tag_tick
# git add .dvc
# git commit -m "Initialized DVC"
# git tag "dvc-init"
#
# dvc add data/images.tar.gz
time dvc exp init python3 src/train.py
+ dvc exp init python3 src/train.py
real 0m3.231s user 0m1.177s
sys 0m0.141s
time dvc add data/
+ dvc add data/
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [05:52, 352.93s/file]
real 5m55.554s
user 4m37.304s
sys 1m1.456s
time dvc exp run
+ dvc exp run
real 7m0.122s
user 6m52.626s
sys 0m15.181s
time dvc exp run -n cnn-32 --queue -S model.conv_units=32 + dvc exp run -n cnn-32 --queue -S model.conv_units=32 real 3m5.789s
user 0m33.405s
sys 0m5.045s
time dvc exp run -n cnn-64 --queue -S model.conv_units=64
+ dvc exp run -n cnn-64 --queue -S model.conv_units=64
real 3m6.764s
user 0m33.531s
sys 0m5.153s
time dvc exp run -n cnn-96 --queue -S model.conv_units=96
+ dvc exp run -n cnn-96 --queue -S model.conv_units=96
real 3m6.632s
user 0m33.511s
sys 0m5.089s
time dvc exp run -n cnn-128 --queue -S model.conv_units=128
+ dvc exp run -n cnn-128 --queue -S model.conv_units=128
real 3m5.999s
user 0m33.134s
sys 0m5.062s
time dvc exp run --run-all --jobs 2
+ dvc exp run --run-all --jobs 2 real 39m29.080s
user 70m26.471s
sys 5m20.166s
time dvc exp show --no-pager
+ dvc exp show --no-pager real 0m1.650s
user 0m0.960s
sys 0m0.107s
Please note that the difference between parallel dvc exp run
vs. the serial one. Running an experiment with dvc exp run --queue
or --temp
takes about 2x more per experiment.
Also, in this VM case, adding to the experiment queue takes around 3 minutes, vs 40 seconds on WSL. No other major processes were running during this test.
BTW, a plain python src/train.py
takes around 4 minutes on this VM.
What about starting by replacing the hidden
Installing the example project
section? Instead of cloning an existing dvc repo, the user can clone/download a stripped down git repo, and then we can show how to setup from there. The workflow can be like:* download/clone repo with code + params.yaml + requirements.txt * virtualenv setup * dvc init * dvc import data * dvc exp init
It's a lot of steps (basically what's there now plus
dvc exp init
), but they are all pretty transparent or simple to explain, and it gives users an idea of how to setup their own projects. It doesn't simplify the page, but it should make it more self contained.
This is certainly possible, though I'm not sure if it's worth it. I was expecting dvc exp init
will make this setup smoother, without additional needs for dvc init
, dvc add
or dvc import
.
Another problem is the performance of dvc add
and dvc import.
To use dvc exp init
, we require the experiment to have a single stage, and ideally, this single stage must use separate images in data/
as input. With 5-10 minutes to dvc add data/
, or 20-30 minutes to dvc import data/
, I doubt users will want to use DVC in another project even if they are patient enough to complete the hands-on tutorial.
No strong opinion on whether to keep this hidden or make a new section for it.
should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ?
Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving.
This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.
Let me ask this straight, would you use DVC in a project with millions of files?
This is certainly possible, though I'm not sure if it's worth it. I was expecting
dvc exp init
will make this setup smoother, without additional needs fordvc init
,dvc add
ordvc import
.
Sorry, I may have given the wrong impression. Those features would be nice, but the primary purpose is to help users get started with experiments. The hope is that dvc exp init -i
in particular provides a more user-friendly onboarding to experiments than dvc stage add ...
that runs on for multiple lines with arcane arguments that each introduce a completely foreign concept to new users.
As far as needing additional commands, auto dvc init
would be nice but is at least lightweight and self-explanatory. Auto dvc add
is more important, but we still need some command for users to get the data initially, right? Does it save any steps in this particular workflow? cc @skshetry
This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial.
Let me ask this straight, would you use DVC in a project with millions of files?
Maybe not -- I'm not really sure today. We are in the middle of changes to address these performance issues, especially for many files (not to mention we have an entirely new product being developed specifically to address this type of scenario). Please continue to comment in relevant issues in the core repo and open issues from your findings here. Maybe we can use these in dvc benchmarks. In the meantime, we still need to address docs needs.
FWIW, my experience is that I have used it for data in the 100s of GBs and found it extremely useful. I think it can feel slow and better performance would have a major impact, but I want to clarify from both personal experience and community interactions that it is useful today in real-world applications. A few points might explain this:
dvc init
time TBH.While we wait for performance to improve, what other options do we have to move the docs forward?
Any other ideas?
Dave,
When we discussed this topic a few months ago, Ivan assured me that the core team has a plan regarding these issues. I'm in no position to decide whether that plan is feasible or not, (and certainly I never intend to be a manager or criticize anyone or push the team to certain direction) but the current situation is not impressive, and I feel frustrated when it comes to tell features of a product that I cannot use pleasantly.
Note that my concerns are concerns of a user, not someone who's making decisions about the project. I used to use DVC to track my personal collections, but currently I don't. When I'm using our own product only because of the professional reasons, I believe that's a red flag.
I can write tickets, but I don't think the gravity of the situation is well understood. Performance (and security) are two aspects that you cannot add to a software project later, they are not like features that you can add at a certain point. Every technical decision regarding features must be made also considering its effect on performance (and security.)
Regarding the particular changes for the example project: I think we can keep the current docs and the project until the performance issues are resolved. I can convert the project to its original format, where the images are loaded from a single file, but I believe that's not @shcheklein would want.
@iesahin I think Dave and the team are very clearly understand the problem and are trying to address it as fast as possible. No one was saying that performance or security are not important. I see your frustration, but the part of building things fast in the early stage environment. We need to adapt quicker and find some workarounds faster. Let's try to discuss some options please and try to help the team as much as we can.
@dberenbaum
Keep as an archive and extract as part of the utility functions in the stage.
that's what we already do, but this would complicate the dvc exp init
, right? that was the initial concern with trying to transition the project to dvc exp init
Use a subset of data.
probably won't work either - still many files to do it quick
Use a different dataset (not focused on many small files).
what are the options here? NLP problems with DL on a text file?
@iesahin Are you following https://github.com/iterative/dvc-bench? @efiop and others are already working on improvements there, but your input can be helpful.
Keep as an archive and extract as part of the utility functions in the stage.
@shcheklein What I mean here is that we have single stage, and inside train.py
it does the extraction on the fly.
@dberenbaum got it, @iesahin is it feasible? ( I remember we had some code that was reading archive on the "fly") ... may be we could even use something like hd5?
Or TensorFlow datasets/formats that package data?
The earliest version of this project was using MNIST's custom image format to obtain the images on the fly. (It was in IPX format and generating numpy arrays from them.) We can revert to that if it sounds good.
Another option is to use a single tar file that contain PNG images. Python supports tar in the standard library.
We can also convert the project to a single file NLP project, similar to example-get-started, but I don't see it's necessary and either of the above two approaches will probably suffice.
@shcheklein @dberenbaum
Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.
Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess.
Thinking about this in the initial version, I've decided that "going with the default, as distributed from the dataset website" is more "excusable." Though it's easier to make a classifier that way, I don't like to use Tensorflow datasets, as we have the corresponding functionality in the dataset-registry.
I'm using the tar version if that's the deal? @shcheklein
I've uploaded a version to https://github.com/iterative/example-dvc-exp-init. This is a staging version, I'll test and fix bugs in this repository, then move to https://github.com/iterative/example-dvc-experiments
This is ready for review and merge. @shcheklein @dberenbaum
Updates https://github.com/iterative/example-dvc-experiments with
dvc exp init
instead ofdvc add stage
plots/confusion.csv
plots/confusion.png
confusion.png
is more clear that way.Closes #96