Support function specific dependencies

drorata commented 4 years ago

Assume we have 4 stages:

Stages 0 and 2 depend on the function foo
Stages 1 and 3 depend on the function bar

In order to reduce code duplication we want to put the utilities foo and bar in one place; let them be in utils.py. So, utils.py becomes a dependency of all four stages.

Now... after some time we realize we have to update something in bar. So we edit utils.py. This, in turn, invalidate all stages. Obviously, a simple solution is to have utils_0_2.py which will be a dependency of 0 and 2 and similarly utils_1_3.py. But this solution conflicts with good practices of coding.

A possible solution could be to adopt some notation like the one used in pytest and specify the function inside utils.py. For example, something like:

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py::foo

This is a continuation of this discussion.

pared commented 4 years ago

Ok, so first obstacle I see is that we are trying to make dvc language agnostic, so the specification of how to define this function dependency and how to calculate its checksum would have to be quite abstract. Having said so, it still looks like reasonable feature request. @iterative/engineering what do you think?

casperdcl commented 4 years ago

We'd need specific per-language support (based on file extension) to extract source code of functions/classes and md5 hash them. Python would be a first obvious language to support but in general we'd need one plugin/parser per lang.

shcheklein commented 4 years ago

Alternative to this is to go all Python. We've seen examples when companies use DVC core API to manipulate stage files from Python. It usually means that users do not use dvc run, etc. They use some custom decorators that internally handle stage files, etc, etc.

casperdcl commented 4 years ago

Not sure whether going "all Python" would be a different implementation than supporting other langs. We'd probably still have to check file extensions/types and print out errors if it's a binary file etc...

jorgeorpinel commented 4 years ago

Note: this may be a duplicate of #1577.

@casperdcl I think Ivan means an alternative for @drorata is to integrate with the DVC using the Python package (pip installation) instead of using dvc run and CLI in general. We don't have any docs about it yet though (see iterative/dvc.org/issues/998), and docstrings are sometimes very complete, sometimes inexistent (see #3176), so that's probably a corner solution for rather advanced users.

casperdcl commented 4 years ago

Well we could support a user-defined external filter tool (similar to git):

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py
  filter: python extract_function.py --name foo

fabiosantoscode commented 4 years ago

I agree with @casperdcl here. I don't think we should be maintaining a whole bunch of tools for this. Maybe we can write just one, for python, and distribute it on pypi to "set the tone" for others to provide other tools with similar arguments (and possibly command name prefixes like dvc-extractfn-python --name foo).

Users of other languages will want to distribute this as a package in their own language. And as long as it's in the PATH and has the same signature, we're fine with it.

Edit: same argument spec and same command prefix aren't mandatory but sure would be nice for user experience.

fabiosantoscode commented 4 years ago

If anyone in the world wants one of these for JavaScript I can write it, it should be 50 lines or less using a pre-existing parser.

elgehelge commented 4 years ago

Are you sure you really want such detailed dependency management @drorata ? Will you remember to update the dependency when somebody comes along and for instance splits the function into two?

def foo(bar):
    return [ba + 1 for ba in bar]

to

def foo(bar):
    return [fo(ba) for ba in bar]

def fo(ba):
    return ba + 1

Do you have an example of "this solution conflicts with good practices of coding" to make it a little less abstract?

casperdcl commented 4 years ago

@elgehelge good point, but I think it's fine for dvc to give users double-edged swords with caveat emptor warnings. If we go down the "support user-defined filter:" route then users could define a really clever filter which even analyses the whole code base and pulls in a function's dependencies. Probably not too hard considering:

it should be 50 lines or less using a pre-existing parser

I see this as part of DVC's "Makefiles-but-better" bit, where there's sub-file granularity.

The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.

If nothing else the end-user filter could be a simple dos2unix | sed -i '/^$/d' to minimise re-running things after whitespace changes...

btw I assume filter is a POSIX-style pipe, i.e. cat path | filter | md5sum is the md5 saved in the DVC-file.

CC @efiop

drorata commented 4 years ago

@elgehelge Good point indeed but still, I believe you can face many cases where this conflict between good coding practices and dependencies can happen. For example, assume that stages 1 and 3 need to fetch data from S3 and stages 2 and 3 need to fetch data from MongoDB. Now, you have two utility functions that take care of fetching the data from the two different sources. From the codebase's structure point of view, I would argue that it make sense to keep these two utility functions inside the same utils.py.

dmpetrov commented 4 years ago

Well we could support a user-defined external filter tool (similar to git):

It is a very good idea to unify it to the command level. So, DVC can support dependencies for any language or tables in databases (#1577).

There is also an opportunity to go further to the unification direction and make it general for other types of "conditional" dependencies such as hyper-parameters from params file (#3393) and Python dependencies as a specially important use case (as @shcheklein mentioned).

The unification can look something like this:

deps:
- md5: 97215532a21d13daa5bf932f93a7d518        # <-- regular file
  path: users.csv
- type: cmd                              # <-- command (can be code deps or DB check)
  checksum: 409a438ee6ac70fc1f911631412325ea
  cmd: node db_table_last_update.js --table users
- type: python                         # <-- native python deps
  checksum: 1df9d34e2031bb7e8
  path: utils.py
  function: utils.check_db
- type: params
  params: "depth=17,L1=0.25,eta=0.2,rotate=[10,20,35]"   # <-- hyper-parameter
  path: params.conf

The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.

I don't see any issues so far (I just don't like the term filter :) ).

btw I assume filter is a POSIX-style pipe, i.e. cat path | filter | md5sum is the md5 saved in the DVC-file.

Right. cmd type should support pipes.

Will you remember to update the dependency when somebody comes along and for instance splits the function into two?

Very good point, @elgehelge! I don't think it makes sense to go too deep into function code and all the dependencies (it's okay to trigger stage if code was refactored without any semantic changes like your example). Also, for simplicity, we can agree to ignore function dependencies changes (so, no checks for fo in your or version of math.exp() if you use that). In your example, for a proper check you need to depend on both python functions: foo() and fo().

casperdcl commented 4 years ago

I would recommended against a type: python.

- type: python                         # <-- native python deps
  checksum: 1df9d34e2031bb7e8
  path: utils.py
  function: utils.check_db

Should be:

- type: cmd
  checksum: 1df9d34e2031bb7e8
  cmd: dvc extract utils.py::utils.check_db

dmpetrov commented 4 years ago

I would recommended against a type: python.

Type python is a "Syntactic sugar". So, I agree, it is better not to start with it and implement down the road if it's needed.

dmpetrov commented 4 years ago

In addition.... type: params is also a "syntactic sugar". However, it is too fundamental for DVC that it has to be supported in dvc-files. I'd even say it is more important than the command-type.

Some external automation tools (like hyperparameters automation/visualization, CD4ML, DVC itself - dvc repro -p learning_rate=0.04) will use that type from dvc-file and the param files (see #3393).

casperdcl commented 4 years ago

So we're down to 2 options:

type: cmd  # or potentially other things
cmd: custom_filter

or

filter: custom_filter

I still prefer the latter (and as a minor point also prefer the word "filter" since that's what git calls it and what users will easily understand). However I understand the need for a type with the implicit default type: file if unspecified.

So really it seems like there's only one option:

type: file  # optional, assumed `file` by default
cmd: custom_filter  # optional

we don't need an explicit type: cmd as it's inferred from the existence of cmd: custom_filter
we only need to debate whether we call this cmd: or filter:...

dmpetrov commented 4 years ago

We need to specify the way how to make checks and what are the requirements for the user's commands/cmd. DVC needs some checksum (or timestamp or another signal) from cmd to identify if the cmd needs to be reproduced.

The requirement for the user code/cmd can be like:

DVC reproduces the stage with a special/conditional command dependency ('type: cmd') only if:
1. The command returns zero return code - success.
2. The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)

Also, I quite don't understand the analogy with Git filters. Is it about git-filter-branch? How does this apply for the conditional commands? @casperdcl, I'd appreciate if you could explain.

casperdcl commented 4 years ago

identify if the cmd needs to be reproduced

I was intending it to always be run (unconditionally) but good point that this may be inefficient. Maybe we need a hash both before and after cmd and only re-run cmd if the hash before has changed.

path: utils.py
md5: 1q2w3e4r
cmd: custom_filter  # if `md5` changed, will be run to produce `cmd_md5`
cmd_md5: 7u8i9o0p

or:

path: utils.py
path_md5: 1q2w3e4r
cmd: custom_filter  # if `path_md5` changed, will be run to produce `md5`
md5: 7u8i9o0p

The command returns zero return code - success.

The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)

I disagree with both options. I'd expect a modification of 2: cat path | cmd | md5sum is the checksum, ergo there is no buffer limit.

analogy with Git filters

I refer to e.g. smudge and clean filters https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#filters_a

dmpetrov commented 4 years ago

identify if the cmd needs to be reproduced

I was intending it to always be run (unconditionally) but good point that this may be inefficient.

It might be some misunderstanding... The whole point of the specific dependencies (this issue as well as hyper-param issue #3393) - not to run stage when the user's code (cmd) can identify that the result will be the same despite the dependency file change.

Maybe we need a hash both before and after cmd and only re-run cmd if the hash before has changed.

Kinds of...

before run: what you already have in dvc-file (might be empty).
after run: what cmd returns. then you update the old value to the new one. So, there are no need in storing both together.

I disagree with both options. I'd expect a modification of 2: cat path | cmd | md5sum is the checksum, ergo there is no buffer limit.

These are not options. These are 2 steps that you need to run to check if repro of a stage is needed. I didn't get your modification (except the limit which is an implementation detail):

Will you accept the result if cmd failed (see step 1)?
Are you forcing users to run exactly the same commands with md5sum etc (md5sum does not exist in Mac OS btw). What if a user'd like to run db_table_last_update.js which returns a table last modification timestamp?

casperdcl commented 4 years ago

@dmpetrov definite misunderstanding here. In the issue here I'm referring to filter as the a command which extracts a subset of the file defined in path.

This is completely and utterly different from the usual (current) definition of cmd which is why I'd prefer a different name. To be explicit:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4  # <-- from `cat path | md5sum`
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84  # <-- from `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `cat path | md5sum`,
  # used to determine if we should bother re-running `filter`
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
  path: model/mymodel.pkl

Where "user filter" is as equivalent to cat path | filter | md5sum to generate the md5 associated with the path. In this case: cat utils.py | python extract_function.py --name check_db | md5sum. The normal (no filter) behaviour (md5 based on path) would be equivalent to filter: cat (or is it actually currently filter: head -c 4M for speed purposes at the cost of correctness? Off-topic question).

I also prefer to do without md5_orig (i.e. instead we should run filter unconditionally) as it's a minimal overhead/inefficiency which can be dealt with in a different issue. Running filter unconditionally will re-calculate path's md5 to be exactly the same, and thus cmd and other dependees won't re-run.

Are you forcing users to run exactly the same commands with md5sum etc (md5sum does not exist in Mac OS btw).

No, the user never runs md5sum, I only use it as shorthand for whatever dvc uses internally (basically python -c 'import sys, hashlib; print(hashlib.md5(sys.stdin.read().encode("U8")).hexdigest())').

What if a user'd like to run db_table_last_update.js which returns a table last modification timestamp

Yes, exactly what filter supports (the timestamp will be md5summed by dvc, though, so not particularly human-readable).

About the confusion, I think you were probably talking about something completely different along the lines of simplifying:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84
  path: /tmp/__check_db
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
  path: model/mymodel.pkl

...
md5: 34984b7d96f970d23bb38cda7d7b3f1a
cmd: cat utils.py | python extract_function.py --name check_db  > /tmp/__check_db
deps:
- md5: 500f914dd06cf1f588b8e281c8fb3dd8
  path: utils.py
outs:
- md5: 809776eaeef854443d39e4f76bd64c84
  path: /tmp/__check_db

dmpetrov commented 4 years ago

@casperdcl thank you for the clarification. It looks like we have less misunderstanding than I thought. And now I understand where your filters come from - filter a part of the source code file.

"user filter" is as equivalent to cat path | filter | md5sum to generate the md5

Anyway we need an output from user defined function (good opportunity to introduce term UDF to DVC 😀 ). The misunderstanding was - should we calculate md5 from the output or not. If we do md5:

[pros] it unifies the checksum representations and we don’t need to invent limits (size limit like 4K or format limit like - only printable symbols)
[cons] user can come up with a better, human-readable representation of the checksum: last update timestamp, number of records in DB table/csv-file, file timestamp (this is what DVC does today in internal DB).

Any of this approaches can work just fine. I’m good with md5. And we can easily extend this later if needed. One more addition, in any of these approaches DVC has to check the command return code as I mention before - no need to run md5 check if cmd failed.

I also prefer to do without md5_orig (i.e. instead we should run filter unconditionally)

Totally agree. The check/cmd has to be lightweight and run for each repro/deps check.

So, it looks like the only disagreement here is the name. I got your analogy with filtering function code from a code file but the code check is not the only option here.

Even for source code deps checks, it is not easy to imagine that someone will be filtering an actual code lines. I’m sure that even for a simple checks we will be dealing with bytecode or different forms of abstract-syntax-trees (ASTs). I guess, this is why I didn’t get the filtering analogy and I’m not sure other users will understand the analogy.

I suggest usingcmd but not filter. Other ideas are also welcome (see UDF up the beginning).

dmpetrov commented 4 years ago

There is one more subject to discuss - do we need the checksums from user defined functions in dvc-file or we can store them in the dvc DB (.dvc/state)?

[pros] No additional checksums in dvc-files which improves human readability.
[cons] The changes will be localized in a user machine (in a repo clone). As a result, any user defined function has to run in a new repo clone at least once (the first time) even if no changes in the “global state” (source code, DB-table).

Implementation details

If someone does not know about DVC DB. .dvc/state file is SQLite database with some internal information. Today DVC stores FS timesteps for large data file in the DB. This is an optimization - if timestemp didn’t change for a file then DVC does not calculate md5 for this file. But this md5 calculation has to run at least one in each clone of a repo for each data file.

There is an analogy between the current FS timestemp optimizations and this one. And we can generalize this kind of optimizations which should improve the codebase.

If you try to imagine how dvc-file would look like without DVC DB you will see something like:

deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: users.csv
  type: fs_timestemp_check
  timestemp: 123456789012

That means that DVC needs to check file timestemp and if it does not match then it should calculate a new checksum and save it with the timestemp in dvc-file.

The idea of moving the checksum to DVC DB seems elegant to me. I don’t see any concerns for now. @iterative/engineering could you please take a look at this approach. Is there something that I missed?

casperdcl commented 4 years ago

When you say

checksums from user defined functions

Do you mean the hash of the UDF/filter itself (rather than its output)? Similar to the md5 for cmd right now?

In which case do we need to store checksums for filters at all? I thought we'd agreed (for now) that they should be run unconditionally so I don't see what said checksums could be used for.

dmpetrov commented 4 years ago

Do you mean the hash of the UDF/filter itself (rather than its output)? Similar to the md5 for cmd right now?

Yes.

In which case do we need to store checksums for filters at all?

🤔 when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run (if you have only an actual dependency checksum but not UDF checksum)?

Let's take one step back. I don't think we can use UDF checksum as the file checksum (md5) because some of DVC algorithms (status, gc) are based on the assumption that file checksum is an actual one and can be recalculated purely based on the file content. UDF checksum is just a "proxy" for optimization purposes (this is why I'm thinking that it needs to be moved away from dvc-file). In theory, we can rethink this principle but it can be a big design change and a really good reason is needed.

casperdcl commented 4 years ago

Hmm I see your concern, but we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems. How about appearing twice? gc probably won't even need patching with this:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

dmpetrov commented 4 years ago

@casperdcl before we jump to cross-system reproducibility let's make sure we are on the same page... It looks like you missed the UDF checksum from the file (md5 is not enough as I said). Didn't you?

EDITED: I expected it to have something like udf_hash:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
  udf_hash: d5964e5fb17c88bbed0d8
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

EDITED 2: the last checksum fd2b169580906cd21350d47b041461b4 is supposed to be removed because of udf_hash

casperdcl commented 4 years ago

when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run

The stage's checksum is the UDF's output's checksum. We run the UDF unconditionally and anything that depends on it will "see" a different checksum (and thus a "changed stage") only if the checksum has changed.

Am I missing something super basic about DVC's use of checksums? Still don't see a need for md5_udf (a.k.a. udf_hash)

dmpetrov commented 4 years ago

I got it. I misinterpreted the phrase "How about appearing twice?". You suggest to keep both of the checksums but in different places of the dvc-file. Yeah, it is the same as udf_hash but with some duplications:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

or

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py
  udf: python extract_function.py --name check_db  # <-- user filter
  udf_hash: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`

Ok, it does not matter at this point.

dmpetrov commented 4 years ago

we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems.

Let's return back to this cross-systems checksum question.

First, storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance a bit - in a new, cloned repo you have to rerun every stage "protected" by UDF because these udf-checksums are not transferable from machine to machine. In exchange, we improve dvc-files readability by getting rid of these UDF checksums.

Second, once build/run-cache is implemented (https://github.com/iterative/dvc/issues/1871#issuecomment-592184042, originally is here #1234) we can transfer the udf-checksums through the cache (and storages) without over polluting dvc-files.

The biggest concern that I have today regarding the udf-checksums in DVC DB - how well it fits the current DVC architecture.

casperdcl commented 4 years ago

storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance

Yes, that's what I meant; should have been more precise.

I also think my suggestion (splitting into path/md5 and path/md5/filter where path is duplicated) misses the point as there would be a dependency on the whole file which is explicitly what we/@drorata do not want.

some of DVC algorithms (status, gc) are based on the assumption that file checksum is an actual one

That's a bit of an issue. I presume going down the route of adding a filter/md5_filter (AKA udf/udf_hash) to the current path as you suggest would not break status/gc while also allowing us to implement special handling of run/repro to check for and use filters?

This takes us all the way back to my first suggestion (albeit now we fully understand it? :slightly_smiling_face:) https://github.com/iterative/dvc/issues/3439#issuecomment-598973822

- md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `md5sum $path`,
  # used to determine if we should bother re-running `$filter`

which needs to be tweaked to:

- md5_filter: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path`

or even:

- filter_hash: 'def check_db():\n    pass'  # `cat $path | filter`
  filter_type: raw  # default `md5`. Using `raw` would skip the md5 sum,
  # (e.g. leave user-readable time stamp)
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path`

dmpetrov commented 4 years ago

This takes us all the way back to my first suggestion (albeit now we fully understand it? 🙂) #3439 (comment)

- md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `md5sum $path`,
  # used to determine if we should bother re-running `$filter`

Right, but it needs to be inverted. The key has to be the file checksum, not UDF.

- md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path
  path: utils.py
  udf: python extract_function.py --name check_db  # user filter
  udf_md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`

casperdcl commented 4 years ago

Yes that's what I said :)

EDIT: Ah I only swapped the key but not the comment. So not quite what I said, but what I meant to say.

So... fine for me to be assigned unless someone else wants to tackle this.

dmpetrov commented 4 years ago

@casperdcl one of the important questions that left is where we store the checksum: dvc-file, DVC DB or run/build-cache (which is not implemented yet). Initially, we can just save in dvc-file. But I'd love to hear @iterative/engineering opinion.

casperdcl commented 4 years ago

I'd prefer DVC-file for now.

The analogy between timestamp-path_checksum and path_checksum-filter_checksum (ordering important) means that filter_checksum shouldn't be put in DVC DB as far as I can see.

Not sure if this has been discussed but when you say run/build cache, do you mean having different classes of cache (e.g. not synced with a remote, manually syncable, etc)?

dmpetrov commented 4 years ago

I don't see any issues with DVC DB except the warm-up run (the first mandatory run after repo cloning). If we can get rid of checksums in dvc-files completely - even better.

run/build cache is about caching run/files checksum globally and storing them in remote. Today, DVC can make repro/not-repro decision based only on the last run only (not runs a few commits back in Git history). Also, this is the way of moving (some of) checksums away from dvc-files and make it more human-readable. The most recent discussion related to run/build-cache: https://github.com/iterative/dvc/issues/1871#issuecomment-592184042

casperdcl commented 4 years ago

I think some users would care about the warm-up. Run/build cache sounds great. Different issue though.

shcheklein commented 4 years ago

My 2cs:

it also reminds me more of a filter then udf (mostly because it's kinda a pipe, not a function definition that could be applied in other place)
also, not clear why would we need to have udf_md5 or filter_md5 - it feels that we can definitely start without them - very simple one field filter: cmd is enough. Later we can add caching if we see use cases when filters run too long, etc. Probably, I would prefer to have them locally (DB or some ./dvc file) vs saving to Git.

Looks cool overall.

shcheklein commented 4 years ago

A few questions to consider, answer:

does it make sense to start providing some very common functions as import dvc.filters?
need to think about forward compatibility a bit - what will be happening to the new project, with new DVC-files that will be running on an old DVC-version - I hope some schema exception.
are we going to provide some CLI support for this? (dvc run).
do we need this for dvc add-ed files - effectively any dependency? Probably we won't be supporting external deps/outs first, right?

shcheklein commented 4 years ago

Had a private chat with @dmpetrov . Summary we both agree that neither udf, nor filter are perfect.

The biggest issue with filter, no matter how much I like it, that it's too low level. There is a risk the name won't resonate with users looking for a solution for certain use cases. Probably, hard to come гз with a full list of those, but think about stuff that is a bit outside of just reading the file and transforming it into something. Let's imagine we want to count number of records in a DB, or custom hash - those are still technically could be called filters to my mind, but are there other human friendly alternatives?

casperdcl commented 4 years ago

suggestions (in order of preference):

filter
filter
filter
map(per)
trans(form(er))
extract(or)
morph(er)
hasher
process(or)
cmd

I think this is also a ranking by word familiarity to non-programmers so really best all round.

elleobrien commented 4 years ago

chiming in bc i get notifications from this post- "filter" has a very specific meaning to many data scientists that is particular to relational algebra. it looks like you are referring to a broader set of functions being encompassed in filter, so I would agree with @shcheklein's last comment

elgehelge commented 4 years ago

"modifier"?

casperdcl commented 4 years ago

The API has changed a bit now in v1 dvc.yaml...

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py

is now simply

deps:
- utils.py

This complicates things a bit in terms of how to save the "filter" command seeing as we now only have a deps list rather than dict.

Can't really think of anything better than having some sort of "virtual" deps, i.e. a dep which is a command to execute and read stdout. Either:

deps:
- 'cmd:python extract_function.py --name check_db' # `cmd:` prefix signifies UDF/filter

or:

deps: []
deps_cmd:
- python extract_function.py --name check_db

Or we can go with supporting a mix of string & dict types (maybe easiest?):

deps:
- some_other_file.py
- path: utils.py
  udf: python extract_function.py --name check_db

dmpetrov commented 4 years ago

Good point re dvc 1.0, @casperdcl

Or we can go with supporting a mix of string & dict types (maybe easiest?):
deps:
- some_other_file.py
- path: utils.py
  udf: python extract_function.py --name check_db

This looks like the best option so far.

Suor commented 4 years ago

People will start to use it to make some external things like db state a dependency.

casperdcl commented 4 years ago

after discussion with @efiop:

path: utils.py
udf: python extract_function.py --name check_db

should mean either python extract_function.py --name check_db utils.py | md5sum - or just python extract_function.py --name check_db | md5sum -. It should not mean cat utils.py | ....

Also, a better spec seems to be:

path: utils.py
udf:
  cmd: python extract_function.py --name check_db
  #md5: s0m3h45h  # in dvc.lock but not dvc.yaml

or even (similar approach to params):

# in dvc.yaml
utils.py:
  cmd: python extract_function.py --name check_db

# in dvc.lock
path: utils.py
cmd: python extract_function.py --name check_db
md5: s0m3h45h

ijlyttle commented 1 year ago

I'm interested in this issue, so forgive me for asking if it is still under consideration.

For me, the use case would be to filter for the source fields of code cells in Jupyter notebooks - to avoid re-running if there are only changes in prose, execution-counts, or the like.

petebachant commented 2 months ago

Going to +1 this as well. I typically like to organize projects around larger modules with a CLI, and defining dependencies inside would be helpful. Syntax like this seems to follow similar to the outs spec:

my_stage:
  deps:
    - mymodule.py:
        - filter: python -c "import inspect, mymodule; print(inspect.getsource(mymodule.myfunc))"
    - the-plan.csv:  # I only want to pay attention to a single row of this CSV
        - filter: python -c "import pandas as pd; print(pd.read_csv('the-plan.csv').iloc[3])"

So, stdout from the filter command is then taken to compute the MD5. However, in this case, the path doesn't actually matter. I guess it's there for readability?

iterative / dvc

Support function specific dependencies #3439

Implementation details