kevin-hanselman / dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.
https://kevin-hanselman.github.io/dud/
BSD 3-Clause "New" or "Revised" License
183 stars 8 forks source link

Wrong symlinks when fetch / checkout is done from symlinked folder #158

Closed thorstenwagner closed 5 months ago

thorstenwagner commented 1 year ago

Assume the following:

/a_dataset.yaml # this is my root directory /A/a.txt

Now a.txt was updated.

If I run the following commands from root /, everything works fine:

rm /A/a.txt
dud fetch
dud checkout

If I run the following commands from /A/, the symlink to a.txt is broken:

rm a.txt
dud fetch
dud checkout

Best, Thorsten

kevin-hanselman commented 1 year ago

Hi @thorstenwagner. I need more information before I can reproduce this.

/a_dataset.yaml # this is my root directory

Do you mean this is your system's root directory (the absolute path /), or your project's root directory (where the .dud directory lives)?

Now a.txt was updated.

Was it ever committed? How was it updated? Was it committed again after the update?

dud fetch

Is dud fetch necessary to reproduce this issue? If so, what's your remote config? Where's dud pushin this scenario?

When I assume the simplest scenario, I can't reproduce this:

dud init
mkdir A
echo foo > A/a.txt
dud stage gen -o A/a.txt | tee a_dataset.yaml
dud stage add a_dataset.yaml
dud commit --copy
echo bar >> A/a.txt
dud commit
tree
rm A/a.txt
dud checkout
tree
cd A
rm a.txt
dud checkout
tree

Output:

Dud project initialized.
See .dud/config.yaml and .dud/rclone.conf to customize the project.
working-dir: .
outputs:
  A/a.txt: {}
Added a_dataset.yaml to the index.
committing stage a_dataset.yaml
  A/a.txt               4 B / 4 B  100%  ?/s  1ms total

committing stage a_dataset.yaml
  A/a.txt               8 B / 8 B  100%  ?/s  1ms total

.
├── A
│   └── a.txt -> ../.dud/cache/ab/b4ca7eb554f159c4970bf8c7c723b724ff9e88cfeb5ee5eec6894f67bcd86b
├── a_dataset.yaml
└── run.sh

2 directories, 3 files
checking out stage a_dataset.yaml
  A/a.txt               1 / 1  100%  ?/s  0s total

.
├── A
│   └── a.txt -> ../.dud/cache/ab/b4ca7eb554f159c4970bf8c7c723b724ff9e88cfeb5ee5eec6894f67bcd86b
├── a_dataset.yaml
└── run.sh

2 directories, 3 files
checking out stage a_dataset.yaml
  A/a.txt               1 / 1  100%  ?/s  0s total

.
└── a.txt -> ../.dud/cache/ab/b4ca7eb554f159c4970bf8c7c723b724ff9e88cfeb5ee5eec6894f67bcd86b

1 directory, 1 file

It would be most helpful if you could provide a Bash script, like this one, which completely reproduces this issue.

thorstenwagner commented 1 year ago

/ is my project directory, not my system root :-) Lets see if I can make it somehow reproducible.

In principle the file A.txt were updated successfully on a different computer. Therefore I'm only tried to fetch the updated data on the different computer

thorstenwagner commented 1 year ago

To give you an example with the actual data:

I'm interested in the file gt.txt: image

tomotwin_evaluation_dataset is my project directory. Now I delete the file gt.txt and then dud fetch; dud checkout:

image

Looks good. Now I navigate to the dataset directory, delete gt.txt and run fetch+checkout: image

As you can see, the symlink is broken now.

thorstenwagner commented 1 year ago

Kudo 2 @mstabrin , he made a reproducible example:

mkdir -p dudtest/mydud
ln -rs dudtest/mydud mydud
cd mydud
dud init
mkdir A
echo foo > A/a.txt
dud stage gen -o A/a.txt | tee a_dataset.yaml
dud stage add a_dataset.yaml
dud commit --copy
echo bar >> A/a.txt
dud commit
tree
rm A/a.txt
dud checkout
tree
cd A
rm a.txt
dud checkout
tree

image

thorstenwagner commented 1 year ago

@mstabrin is wondering: Why is the symlink relative to root? ^^

kevin-hanselman commented 1 year ago

Thanks for the example that I can run and reproduce!

Having a project directory be a symlink is an unexpected use case. It's certainly not something I was planning to support, simply because I hadn't thought of it. I will poke at this a bit further, and if a fix is simple I will add it. But I can't guarantee support for a symlinked project directory. Can you help me understand the motivation behind this pattern?

kevin-hanselman commented 1 year ago

My hunch is that this is the issue. From the Go docs (emphasis my own):

Getwd returns a rooted path name corresponding to the current directory. If the current directory can be reached via multiple paths (due to symbolic links), Getwd may return any one of them.

Consequently, I'm not sure how easy this will be to fix. I'll keep looking at it, though.

thorstenwagner commented 5 months ago

Would be really nice to see that fixed. I just happend again to me. I keep my training data separated from the actual training runs, instead I symlink the data. Now I was cd'ing into the symlinked data folder and add some new data. Now its broken again :-(

btw, havent seen a commit for while. Is dud still maintained? If not, I need to find another alternative although I really love dud. Its just does what I need ^^

kevin-hanselman commented 5 months ago

Hi @thorstenwagner! As I mentioned above, using symlinked root folders is not recommended in Dud due to a limitation in Go's os.Getwd.

For now I will close this issue, but please use this thread to explain your project setup in more detail. If you do, I should be able to recommend an alternative configuration that mitigates this issue.

Regarding Dud being maintained, I have definitely not abandoned Dud; I still use it myself. But "maintained" means different things to different projects and people. What does maintained mean to you? Right now I am very busy with both work and life, and I haven't been able to commit to Dud (literally 😃) as much as I'd like. This being an open-source project, I am always happy to review and merge pull requests 😃.

thorstenwagner commented 5 months ago

My hunch is that this is the issue. From the Go docs (emphasis my own):

Getwd returns a rooted path name corresponding to the current directory. If the current directory can be reached via multiple paths (due to symbolic links), Getwd may return any one of them.

Consequently, I'm not sure how easy this will be to fix. I'll keep looking at it, though.

One idea for a workaround on linux systems: You could use the output of getcwd and then use os.exec to call realpath OUTCWD?