kevin-hanselman / dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.
https://kevin-hanselman.github.io/dud/
BSD 3-Clause "New" or "Revised" License
183 stars 8 forks source link

Incorrect file type: directory #164

Open indigoviolet opened 1 year ago

indigoviolet commented 1 year ago

Please acknowledge the following

Describe the bug There is a confusing message "incorrect file type: directory" in dud status

System information

Output of dud version:

dud version
0.4.3

Output of uname -srmo:

uname -srmo
Darwin 22.5.0 arm64 Darwin

Steps to Reproduce Steps to reproduce the behavior. Ideally this a copy-paste-able shell script (or set of small scripts) that reproduces the problem.


❯ dud init   
Dud project initialized.
See .dud/config.yaml and .dud/rclone.conf to customize the project.

~/dev/dud-test 
❯ mkdir foo && echo "foo" > foo/bar

~/dev/dud-test 
❯ dud stage gen -i foo -o bar -- cat >| test.yaml

~/dev/dud-test 
❯ dud stage add test.yaml                        
Added test.yaml to the index.

~/dev/dud-test 
❯ dud status             
test.yaml  stage definition not checksummed
  foo      incorrect file type: directory (not cached)
  bar      missing and not committed

❯ \cat test.yaml
command: cat
working-dir: .
inputs:
  foo:
    is-dir: true
outputs:
  bar: {}

Expected behavior No such message should appear. It is inconsistent, in other cases I don't see it, so I am super confused. Removing is-dir: true doesn't make the message go away either.

kevin-hanselman commented 1 year ago

Good find! This situation brings to light several things that need consideration:

Dud treats all inputs to a stage as not cached (skip-cache: true) unless it's owned by another stage. Essentially a stage owns its outputs, and it references its inputs. So in Dud, currently, the recommended way to handle this situation would be to use another stage:

$ dud stage gen -o foo | tee foo.yaml
working-dir: .
outputs:
  foo:
    is-dir: true

$ dud stage gen -i foo -o bar -- cat | tee bar.yaml
command: cat
working-dir: .
inputs:
  foo:
    is-dir: true
outputs:
  bar: {}

$ dud stage add *.yaml
Added bar.yaml to the index.
Added foo.yaml to the index.

$ dud status
foo.yaml  stage definition not checksummed
  foo     1x directory, 1x not committed

bar.yaml  stage definition not checksummed
  bar     missing and not committed

I can see how this could be unintuitive. Frankly, the implicit skip-cache: true has always bugged me. I will reevaluate if it's even needed.

What's more, if you commit a stage with a parent-less directory input, you get skip-cache: false behavior 🤦‍♂️:

# removed '-o bar', which doesn't exist
$ dud stage gen -i foo -- cat > test.yaml

$ dud stage add test.yaml

$ dud commit
committing stage test.yaml
  foo                   4 B / 4 B  100%  ?/s  0s total

# This shouldn't have been committed in Dud's current model.
$ tree -a foo
foo
`-- bar -> ../.dud/cache/49/dc870df1de7fd60794cebce449f5ccdae575affaa67a24b62acb03e039db92

So I have some thinking to do about Dud's stage ownership model. There might be an opportunity here to simplify things by removing implicit behavior. In the meantime, I'd recommend following the pattern in my first example code block.

indigoviolet commented 1 year ago

Thank you for the quick response! Your first example worked for me. Besides the things you pointed out, please consider changing the error message (incorrect file type: directory) and adding documentation to explain this.