Open drorata opened 4 years ago
Ok, so first obstacle I see is that we are trying to make dvc language agnostic, so the specification of how to define this function dependency
and how to calculate its checksum would have to be quite abstract. Having said so, it still looks like reasonable feature request.
@iterative/engineering what do you think?
We'd need specific per-language support (based on file extension) to extract source code of functions/classes and md5 hash them. Python would be a first obvious language to support but in general we'd need one plugin/parser per lang.
Alternative to this is to go all Python. We've seen examples when companies use DVC core API to manipulate stage files from Python. It usually means that users do not use dvc run
, etc. They use some custom decorators that internally handle stage files, etc, etc.
Not sure whether going "all Python" would be a different implementation than supporting other langs. We'd probably still have to check file extensions/types and print out errors if it's a binary file etc...
Note: this may be a duplicate of #1577.
@casperdcl I think Ivan means an alternative for @drorata is to integrate with the DVC using the Python package (pip installation) instead of using dvc run
and CLI in general. We don't have any docs about it yet though (see iterative/dvc.org/issues/998), and docstrings are sometimes very complete, sometimes inexistent (see #3176), so that's probably a corner solution for rather advanced users.
Well we could support a user-defined external filter tool (similar to git
):
deps:
- md5: 409a438ee6ac70fc1f911631412325ea
path: utils.py
filter: python extract_function.py --name foo
I agree with @casperdcl here. I don't think we should be maintaining a whole bunch of tools for this. Maybe we can write just one, for python, and distribute it on pypi to "set the tone" for others to provide other tools with similar arguments (and possibly command name prefixes like dvc-extractfn-python --name foo
).
Users of other languages will want to distribute this as a package in their own language. And as long as it's in the PATH and has the same signature, we're fine with it.
Edit: same argument spec and same command prefix aren't mandatory but sure would be nice for user experience.
If anyone in the world wants one of these for JavaScript I can write it, it should be 50 lines or less using a pre-existing parser.
Are you sure you really want such detailed dependency management @drorata ? Will you remember to update the dependency when somebody comes along and for instance splits the function into two?
def foo(bar):
return [ba + 1 for ba in bar]
to
def foo(bar):
return [fo(ba) for ba in bar]
def fo(ba):
return ba + 1
Do you have an example of "this solution conflicts with good practices of coding" to make it a little less abstract?
@elgehelge good point, but I think it's fine for dvc
to give users double-edged swords with caveat emptor warnings. If we go down the "support user-defined filter:
" route then users could define a really clever filter which even analyses the whole code base and pulls in a function's dependencies. Probably not too hard considering:
it should be 50 lines or less using a pre-existing parser
I see this as part of DVC's "Makefiles-but-better" bit, where there's sub-file granularity.
The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.
If nothing else the end-user filter
could be a simple dos2unix | sed -i '/^$/d'
to minimise re-running things after whitespace changes...
btw I assume filter
is a POSIX-style pipe, i.e. cat path | filter | md5sum
is the md5
saved in the DVC-file.
CC @efiop
@elgehelge Good point indeed but still, I believe you can face many cases where this conflict between good coding practices and dependencies can happen. For example, assume that stages 1 and 3 need to fetch data from S3 and stages 2 and 3 need to fetch data from MongoDB. Now, you have two utility functions that take care of fetching the data from the two different sources. From the codebase's structure point of view, I would argue that it make sense to keep these two utility functions inside the same utils.py
.
Well we could support a user-defined external filter tool (similar to
git
):
It is a very good idea to unify it to the command level. So, DVC can support dependencies for any language or tables in databases (#1577).
There is also an opportunity to go further to the unification direction and make it general for other types of "conditional" dependencies such as hyper-parameters from params file (#3393) and Python dependencies as a specially important use case (as @shcheklein mentioned).
The unification can look something like this:
deps:
- md5: 97215532a21d13daa5bf932f93a7d518 # <-- regular file
path: users.csv
- type: cmd # <-- command (can be code deps or DB check)
checksum: 409a438ee6ac70fc1f911631412325ea
cmd: node db_table_last_update.js --table users
- type: python # <-- native python deps
checksum: 1df9d34e2031bb7e8
path: utils.py
function: utils.check_db
- type: params
params: "depth=17,L1=0.25,eta=0.2,rotate=[10,20,35]" # <-- hyper-parameter
path: params.conf
The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.
I don't see any issues so far (I just don't like the term filter
:) ).
btw I assume
filter
is a POSIX-style pipe, i.e.cat path | filter | md5sum
is themd5
saved in the DVC-file.
Right. cmd
type should support pipes.
Will you remember to update the dependency when somebody comes along and for instance splits the function into two?
Very good point, @elgehelge! I don't think it makes sense to go too deep into function code and all the dependencies (it's okay to trigger stage if code was refactored without any semantic changes like your example). Also, for simplicity, we can agree to ignore function dependencies changes (so, no checks for fo
in your or version of math.exp()
if you use that). In your example, for a proper check you need to depend on both python functions: foo()
and fo()
.
I would recommended against a type: python
.
- type: python # <-- native python deps checksum: 1df9d34e2031bb7e8 path: utils.py function: utils.check_db
Should be:
- type: cmd
checksum: 1df9d34e2031bb7e8
cmd: dvc extract utils.py::utils.check_db
I would recommended against a
type: python
.
Type python is a "Syntactic sugar". So, I agree, it is better not to start with it and implement down the road if it's needed.
In addition.... type: params
is also a "syntactic sugar". However, it is too fundamental for DVC that it has to be supported in dvc-files. I'd even say it is more important than the command-type.
Some external automation tools (like hyperparameters automation/visualization, CD4ML, DVC itself - dvc repro -p learning_rate=0.04
) will use that type from dvc-file and the param files (see #3393).
So we're down to 2 options:
type: cmd # or potentially other things
cmd: custom_filter
or
filter: custom_filter
I still prefer the latter (and as a minor point also prefer the word "filter" since that's what git
calls it and what users will easily understand). However I understand the need for a type
with the implicit default type: file
if unspecified.
So really it seems like there's only one option:
type: file # optional, assumed `file` by default
cmd: custom_filter # optional
type: cmd
as it's inferred from the existence of cmd: custom_filter
cmd:
or filter:
...We need to specify the way how to make checks and what are the requirements for the user's commands/cmd
. DVC needs some checksum (or timestamp or another signal) from cmd
to identify if the cmd
needs to be reproduced.
The requirement for the user code/cmd
can be like:
DVC reproduces the stage with a special/conditional command dependency ('type: cmd') only if:
1. The command returns zero return code - success.
2. The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)
Also, I quite don't understand the analogy with Git filters. Is it about git-filter-branch
? How does this apply for the conditional commands? @casperdcl, I'd appreciate if you could explain.
identify if the
cmd
needs to be reproduced
I was intending it to always be run (unconditionally) but good point that this may be inefficient. Maybe we need a hash both before and after cmd
and only re-run cmd
if the hash before has changed.
path: utils.py
md5: 1q2w3e4r
cmd: custom_filter # if `md5` changed, will be run to produce `cmd_md5`
cmd_md5: 7u8i9o0p
or:
path: utils.py
path_md5: 1q2w3e4r
cmd: custom_filter # if `path_md5` changed, will be run to produce `md5`
md5: 7u8i9o0p
- The command returns zero return code - success.
- The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)
I disagree with both options. I'd expect a modification of 2: cat path | cmd | md5sum
is the checksum, ergo there is no buffer limit.
analogy with Git filters
I refer to e.g. smudge
and clean
filters https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#filters_a
identify if the
cmd
needs to be reproducedI was intending it to always be run (unconditionally) but good point that this may be inefficient.
It might be some misunderstanding... The whole point of the specific dependencies (this issue as well as hyper-param issue #3393) - not to run stage when the user's code (cmd
) can identify that the result will be the same despite the dependency file change.
Maybe we need a hash both before and after
cmd
and only re-runcmd
if the hash before has changed.
Kinds of...
cmd
returns. then you update the old value to the new one. So, there are no need in storing both together.I disagree with both options. I'd expect a modification of 2:
cat path | cmd | md5sum
is the checksum, ergo there is no buffer limit.
These are not options. These are 2 steps that you need to run to check if repro
of a stage is needed. I didn't get your modification (except the limit which is an implementation detail):
cmd
failed (see step 1)?md5sum
does not exist in Mac OS btw). What if a user'd like to run db_table_last_update.js
which returns a table last modification timestamp?@dmpetrov definite misunderstanding here. In the issue here I'm referring to filter
as the a command which extracts a subset of the file defined in path
.
This is completely and utterly different from the usual (current) definition of cmd
which is why I'd prefer a different name. To be explicit:
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4 # <-- from `cat path | md5sum`
path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84 # <-- from `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8 # <-- optional, from `cat path | md5sum`,
# used to determine if we should bother re-running `filter`
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
path: model/mymodel.pkl
Where "user filter" is as equivalent to cat path | filter | md5sum
to generate the md5 associated with the path. In this case: cat utils.py | python extract_function.py --name check_db | md5sum
. The normal (no filter
) behaviour (md5
based on path
) would be equivalent to filter: cat
(or is it actually currently filter: head -c 4M
for speed purposes at the cost of correctness? Off-topic question).
I also prefer to do without md5_orig
(i.e. instead we should run filter
unconditionally) as it's a minimal overhead/inefficiency which can be dealt with in a different issue. Running filter
unconditionally will re-calculate path
's md5
to be exactly the same, and thus cmd
and other dependees won't re-run.
Are you forcing users to run exactly the same commands with md5sum etc (
md5sum
does not exist in Mac OS btw).
No, the user never runs md5sum
, I only use it as shorthand for whatever dvc
uses internally (basically python -c 'import sys, hashlib; print(hashlib.md5(sys.stdin.read().encode("U8")).hexdigest())'
).
What if a user'd like to run db_table_last_update.js which returns a table last modification timestamp
Yes, exactly what filter
supports (the timestamp will be md5summed by dvc
, though, so not particularly human-readable).
About the confusion, I think you were probably talking about something completely different along the lines of simplifying:
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84
path: /tmp/__check_db
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
path: model/mymodel.pkl
...
md5: 34984b7d96f970d23bb38cda7d7b3f1a
cmd: cat utils.py | python extract_function.py --name check_db > /tmp/__check_db
deps:
- md5: 500f914dd06cf1f588b8e281c8fb3dd8
path: utils.py
outs:
- md5: 809776eaeef854443d39e4f76bd64c84
path: /tmp/__check_db
@casperdcl thank you for the clarification. It looks like we have less misunderstanding than I thought. And now I understand where your filters come from - filter a part of the source code file.
"user filter" is as equivalent to
cat path | filter | md5sum
to generate the md5
Anyway we need an output from user defined function (good opportunity to introduce term UDF to DVC 😀 ). The misunderstanding was - should we calculate md5 from the output or not. If we do md5:
Any of this approaches can work just fine. I’m good with md5. And we can easily extend this later if needed. One more addition, in any of these approaches DVC has to check the command return code as I mention before - no need to run md5 check if cmd
failed.
I also prefer to do without
md5_orig
(i.e. instead we should runfilter
unconditionally)
Totally agree. The check/cmd
has to be lightweight and run for each repro/deps check.
So, it looks like the only disagreement here is the name. I got your analogy with filtering function code from a code file but the code check is not the only option here.
Even for source code deps checks, it is not easy to imagine that someone will be filtering an actual code lines. I’m sure that even for a simple checks we will be dealing with bytecode or different forms of abstract-syntax-trees (ASTs). I guess, this is why I didn’t get the filtering analogy and I’m not sure other users will understand the analogy.
I suggest usingcmd
but not filter
. Other ideas are also welcome (see UDF up the beginning).
There is one more subject to discuss - do we need the checksums from user defined functions in dvc-file or we can store them in the dvc DB (.dvc/state
)?
If someone does not know about DVC DB. .dvc/state
file is SQLite database with some internal information. Today DVC stores FS timesteps for large data file in the DB. This is an optimization - if timestemp didn’t change for a file then DVC does not calculate md5 for this file. But this md5 calculation has to run at least one in each clone of a repo for each data file.
There is an analogy between the current FS timestemp optimizations and this one. And we can generalize this kind of optimizations which should improve the codebase.
If you try to imagine how dvc-file would look like without DVC DB you will see something like:
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
path: users.csv
type: fs_timestemp_check
timestemp: 123456789012
That means that DVC needs to check file timestemp and if it does not match then it should calculate a new checksum and save it with the timestemp in dvc-file.
The idea of moving the checksum to DVC DB seems elegant to me. I don’t see any concerns for now. @iterative/engineering could you please take a look at this approach. Is there something that I missed?
When you say
checksums from user defined functions
Do you mean the hash of the UDF/filter
itself (rather than its output)? Similar to the md5
for cmd
right now?
In which case do we need to store checksums for filters at all? I thought we'd agreed (for now) that they should be run unconditionally so I don't see what said checksums could be used for.
Do you mean the hash of the UDF/
filter
itself (rather than its output)? Similar to themd5
forcmd
right now?
Yes.
In which case do we need to store checksums for filters at all?
🤔 when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run (if you have only an actual dependency checksum but not UDF checksum)?
Let's take one step back. I don't think we can use UDF checksum as the file checksum (md5
) because some of DVC algorithms (status
, gc
) are based on the assumption that file checksum is an actual one and can be recalculated purely based on the file content. UDF checksum is just a "proxy" for optimization purposes (this is why I'm thinking that it needs to be moved away from dvc-file). In theory, we can rethink this principle but it can be a big design change and a really good reason is needed.
Hmm I see your concern, but we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems. How about appearing twice? gc
probably won't even need patching with this:
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py
@casperdcl before we jump to cross-system reproducibility let's make sure we are on the same page... It looks like you missed the UDF checksum from the file (md5
is not enough as I said). Didn't you?
EDITED: I expected it to have something like udf_hash
:
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
udf_hash: d5964e5fb17c88bbed0d8
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py
EDITED 2: the last checksum fd2b169580906cd21350d47b041461b4
is supposed to be removed because of udf_hash
when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run
The stage's checksum is the UDF's output's checksum. We run the UDF unconditionally and anything that depends on it will "see" a different checksum (and thus a "changed stage") only if the checksum has changed.
Am I missing something super basic about DVC's use of checksums? Still don't see a need for md5_udf
(a.k.a. udf_hash
)
I got it. I misinterpreted the phrase "How about appearing twice?". You suggest to keep both of the checksums but in different places of the dvc-file. Yeah, it is the same as udf_hash
but with some duplications:
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py
or
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py
udf: python extract_function.py --name check_db # <-- user filter
udf_hash: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
Ok, it does not matter at this point.
we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems.
Let's return back to this cross-systems checksum question.
First, storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance a bit - in a new, cloned repo you have to rerun every stage "protected" by UDF because these udf-checksums are not transferable from machine to machine. In exchange, we improve dvc-files readability by getting rid of these UDF checksums.
Second, once build/run-cache is implemented (https://github.com/iterative/dvc/issues/1871#issuecomment-592184042, originally is here #1234) we can transfer the udf-checksums through the cache (and storages) without over polluting dvc-files.
The biggest concern that I have today regarding the udf-checksums in DVC DB - how well it fits the current DVC architecture.
storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance
Yes, that's what I meant; should have been more precise.
I also think my suggestion (splitting into path/md5
and path/md5/filter
where path is duplicated) misses the point as there would be a dependency on the whole file which is explicitly what we/@drorata do not want.
some of DVC algorithms (
status
,gc
) are based on the assumption that file checksum is an actual one
That's a bit of an issue. I presume going down the route of adding a filter/md5_filter
(AKA udf/udf_hash
) to the current path
as you suggest would not break status
/gc
while also allowing us to implement special handling of run
/repro
to check for and use filters?
This takes us all the way back to my first suggestion (albeit now we fully understand it? :slightly_smiling_face:) https://github.com/iterative/dvc/issues/3439#issuecomment-598973822
- md5: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8 # <-- optional, from `md5sum $path`,
# used to determine if we should bother re-running `$filter`
which needs to be tweaked to:
- md5_filter: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path`
or even:
- filter_hash: 'def check_db():\n pass' # `cat $path | filter`
filter_type: raw # default `md5`. Using `raw` would skip the md5 sum,
# (e.g. leave user-readable time stamp)
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path`
This takes us all the way back to my first suggestion (albeit now we fully understand it? 🙂) #3439 (comment)
- md5: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum` path: utils.py filter: python extract_function.py --name check_db # user filter md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8 # <-- optional, from `md5sum $path`, # used to determine if we should bother re-running `$filter`
Right, but it needs to be inverted. The key has to be the file checksum, not UDF.
- md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path
path: utils.py
udf: python extract_function.py --name check_db # user filter
udf_md5: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum`
Yes that's what I said :)
EDIT: Ah I only swapped the key but not the comment. So not quite what I said, but what I meant to say.
So... fine for me to be assigned unless someone else wants to tackle this.
@casperdcl one of the important questions that left is where we store the checksum: dvc-file, DVC DB or run/build-cache (which is not implemented yet). Initially, we can just save in dvc-file. But I'd love to hear @iterative/engineering opinion.
I'd prefer DVC-file for now.
The analogy between timestamp
-path_checksum
and path_checksum
-filter_checksum
(ordering important) means that filter_checksum
shouldn't be put in DVC DB as far as I can see.
Not sure if this has been discussed but when you say run/build cache, do you mean having different classes of cache (e.g. not synced with a remote, manually syncable, etc)?
I don't see any issues with DVC DB except the warm-up run (the first mandatory run after repo cloning). If we can get rid of checksums in dvc-files completely - even better.
run/build cache is about caching run/files checksum globally and storing them in remote. Today, DVC can make repro/not-repro decision based only on the last run only (not runs a few commits back in Git history). Also, this is the way of moving (some of) checksums away from dvc-files and make it more human-readable. The most recent discussion related to run/build-cache: https://github.com/iterative/dvc/issues/1871#issuecomment-592184042
I think some users would care about the warm-up. Run/build cache sounds great. Different issue though.
My 2cs:
udf_md5
or filter_md5
- it feels that we can definitely start without them - very simple one field filter: cmd
is enough. Later we can add caching if we see use cases when filters run too long, etc. Probably, I would prefer to have them locally (DB or some ./dvc file) vs saving to Git.Looks cool overall.
A few questions to consider, answer:
import dvc.filters
? dvc run
).dvc add
-ed files - effectively any dependency? Probably we won't be supporting external deps/outs first, right?Had a private chat with @dmpetrov . Summary we both agree that neither udf
, nor filter
are perfect.
The biggest issue with filter
, no matter how much I like it, that it's too low level. There is a risk the name won't resonate with users looking for a solution for certain use cases. Probably, hard to come гз with a full list of those, but think about stuff that is a bit outside of just reading the file and transforming it into something. Let's imagine we want to count number of records in a DB, or custom hash - those are still technically could be called filters to my mind, but are there other human friendly alternatives?
suggestions (in order of preference):
I think this is also a ranking by word familiarity to non-programmers so really best all round.
chiming in bc i get notifications from this post- "filter" has a very specific meaning to many data scientists that is particular to relational algebra. it looks like you are referring to a broader set of functions being encompassed in filter
, so I would agree with @shcheklein's last comment
"modifier"?
The API has changed a bit now in v1 dvc.yaml
...
deps:
- md5: 409a438ee6ac70fc1f911631412325ea
path: utils.py
is now simply
deps:
- utils.py
This complicates things a bit in terms of how to save the "filter" command seeing as we now only have a deps list rather than dict.
Can't really think of anything better than having some sort of "virtual" deps, i.e. a dep which is a command to execute and read stdout. Either:
deps:
- 'cmd:python extract_function.py --name check_db' # `cmd:` prefix signifies UDF/filter
or:
deps: []
deps_cmd:
- python extract_function.py --name check_db
Or we can go with supporting a mix of string & dict types (maybe easiest?):
deps:
- some_other_file.py
- path: utils.py
udf: python extract_function.py --name check_db
Good point re dvc 1.0, @casperdcl
Or we can go with supporting a mix of string & dict types (maybe easiest?):
deps: - some_other_file.py - path: utils.py udf: python extract_function.py --name check_db
This looks like the best option so far.
People will start to use it to make some external things like db state a dependency.
after discussion with @efiop:
path: utils.py
udf: python extract_function.py --name check_db
should mean either python extract_function.py --name check_db utils.py | md5sum -
or just python extract_function.py --name check_db | md5sum -
. It should not mean cat utils.py | ...
.
Also, a better spec seems to be:
path: utils.py
udf:
cmd: python extract_function.py --name check_db
#md5: s0m3h45h # in dvc.lock but not dvc.yaml
or even (similar approach to params):
# in dvc.yaml
utils.py:
cmd: python extract_function.py --name check_db
# in dvc.lock
path: utils.py
cmd: python extract_function.py --name check_db
md5: s0m3h45h
I'm interested in this issue, so forgive me for asking if it is still under consideration.
For me, the use case would be to filter for the source
fields of code cells in Jupyter notebooks - to avoid re-running if there are only changes in prose, execution-counts, or the like.
Going to +1 this as well. I typically like to organize projects around larger modules with a CLI, and defining dependencies inside would be helpful. Syntax like this seems to follow similar to the outs
spec:
my_stage:
deps:
- mymodule.py:
- filter: python -c "import inspect, mymodule; print(inspect.getsource(mymodule.myfunc))"
- the-plan.csv: # I only want to pay attention to a single row of this CSV
- filter: python -c "import pandas as pd; print(pd.read_csv('the-plan.csv').iloc[3])"
So, stdout
from the filter
command is then taken to compute the MD5. However, in this case, the path doesn't actually matter. I guess it's there for readability?
Assume we have 4 stages:
0
and2
depend on the functionfoo
1
and3
depend on the functionbar
In order to reduce code duplication we want to put the utilities
foo
andbar
in one place; let them be inutils.py
. So,utils.py
becomes a dependency of all four stages.Now... after some time we realize we have to update something in
bar
. So we editutils.py
. This, in turn, invalidate all stages. Obviously, a simple solution is to haveutils_0_2.py
which will be a dependency of0
and2
and similarlyutils_1_3.py
. But this solution conflicts with good practices of coding.A possible solution could be to adopt some notation like the one used in
pytest
and specify the function insideutils.py
. For example, something like:This is a continuation of this discussion.