rstojnic / lazydata

Lazydata: Scalable data dependencies for Python projects
Apache License 2.0
624 stars 23 forks source link

All local file revisions hardlink to the latest revision #4

Closed zbitouzakaria closed 6 years ago

zbitouzakaria commented 6 years ago

I tested this out by creating and tracking a single file through multiple revisions.

Let's say we have a big_file.csv whose content look like this:

a, b, c
1, 2, 3

We first track it using this script:

from lazydata import track

# store the file when loading  
import pandas as pd
df = pd.read_csv(track("big_file.csv"))

print("Data shape:" + str(df.shape))

Change the file content multiple times, for ex:

a, b, c
1, 2, 3
4, 5, 6

And keep executing the script between the multiple revisions:

(dev3.5)  ~/test_lazydata > python my_script.py 
LAZYDATA: Tracking new file `big_file.csv`
Data shape:(1, 3)
(dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
(dev3.5)  ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(2, 3)
(dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
(dev3.5)  ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(3, 3)
(dev3.5)  ~/test_lazydata > vim big_file.csv  # changing file
(dev3.5)  ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(4, 3)

A simple ls afterwards points to the mistake:

(dev3.5)  ~/test_lazydata > ls -lah
total 20
drwxrwxr-x  2 zakaria zakaria 4096 sept.  5 16:14 .
drwxr-xr-x 56 zakaria zakaria 4096 sept.  5 16:14 ..
-rw-rw-r--  5 zakaria zakaria   44 sept.  5 16:14 big_file.csv
-rw-rw-r--  1 zakaria zakaria  482 sept.  5 16:14 lazydata.yml
-rw-rw-r--  1 zakaria zakaria  158 sept.  5 16:12 my_script.py

Notice the number of hardlinks to big_file.csv. There should only be one. What is happening is that all the revisions point to the same file.

You can also check ~/.lazydata/data directly for the content of the different files. It'a all the same.

rstojnic commented 6 years ago

Yes, you are right. Because they are hardlinked editing one file will also edit the cached file. One would need to overwrite one of the files to get a new inode. I guess this means the file do need to be copied unless the user wants to specifically use hardlinking.

rstojnic commented 6 years ago

Thanks for this bug report!

I've now switched to using copy instead of hardlink as a default. Will probably add hardlinking as an option, and probably still need to write a test case for this specific case.

The latest version lazydata 1.0.16 should have this bug solved.

rstojnic commented 6 years ago

... and further fixes in lazydata 1.0.17