koordinates / kart

Distributed version-control for geospatial and tabular data
https://kartproject.org
Other
528 stars 41 forks source link

Attachment support for Kart #583

Open olsen232 opened 2 years ago

olsen232 commented 2 years ago

Kart repos are actually Git repos, which of course are version-controlled folder hierarchies of files. All of these files are hidden away inside the Git Object Database, which is in .kart/objects but which can't be read without the proper Git tools. Embedded inside these version controlled hierarchies of files are "datasets" - folders which contain lots of files in a particular format, which represent a database table (optionally with geometry data in one of the columns). At present, Kart only exposes that dataset part of the folder hierarchy - any other files in the hierarchy are only visible if you know the right Git tools to see them. But, they are there - repositories can already contain these "attachments", some repositories do, and it is possible to add more attachments using a hidden command called kart commit-files.

View existing attachments

Try running git ls-tree -t -r HEAD | grep -v dataset | grep blob to see if there are any attachments at HEAD in a Kart repository.

Summary of steps to be taken

  1. [x] 1. Detect changes to attachments between two commits during kart diff COMMIT1 COMMIT2

  2. [x] 2. Modify the internal structure in which diffs are stored so it can hold diffs to attachments, not just diffs to datasets

  3. [x] 3. Modify the output of kart diff so that it can show these attachment diffs in some way

  4. [x] 4. Create a file-based working copy, separate from the database working copy that Kart so far has, so that all checkout operations continue to check out database contents to the database but now also attachments to the filesystem (Big job. The following existing Git technologies may be useful here: index files, and sparse checkouts)

  5. [ ] 4b. Actually check out attached files into the file-based working copy.

  6. [ ] 5. Detect when attachments have been changed in the user's filesystem when running commands like kart diff and kart status, and include that information to the output

  7. [ ] 6. Make sure attachment changes are committed when running kart commit

  8. [ ] 7. Make sure that performing a checkout operation without --discard-changes doesn't discard any changes (it can simply abort itself instead if there are changes that otherwise would be lost), and similarly make sure that performing a checkout operation with --discard-changes does discard all changes, and forceably checks out the target commit.

  9. [ ] 8. Extend filters so that they work for attachments and not just datasets (ie, the user can ask to see the diff of one attachment, just as they can ask to see the diff of just one dataset right now, or commit only some of the diffs)

  10. [ ] 9. Some attachments can also be stored in databases. An example is that the GPKG has a specific database table in which it indicates metadata XML files can be stored. Ideally, these files would be editable and committable from either the database table or the file system, which involves a) doing a three-way diff and finding the odd one out and b) alerting the user if they have made a conflicting change to the one file in two different places.

olsen232 commented 2 years ago

Datasets with attachments

tests/data/points.tgz and tests/data/polygons.tgz are both gzipped tarballs of Kart repositories. If you extract them, and run the command from above: git ls-tree -t -r HEAD | grep -v dataset | grep blob you can see that they both contain two attachments - something like the following:

100644 blob 00750edc07d6415dcc07ae0351e9397b0222b7ba    .kart.repostructure.version
100644 blob 6f32319194ae3afdac968198d937c1c447be0e55    nz_waca_adjustments/metadata.xml

(The first of these attachments .kart.repostructure.version - is actually a Kart-internal version marker and doesn't need to be exposed. But the second you can consider as a normal attachment)

olsen232 commented 2 years ago

More detail on step 1

A good first step would be to simply detect and log which attachments have changed between two commits eg: kart show HEAD on the polygons repo above shows that the nz_waca_adjustments dataset was created in the previous repo, but it doesn't show that the nz_waca_adjustments/metadata.xml attachment was created. Before making any change to the diff-related data structures, the budding contributor could try to detect and log any and all attachments changes. Even so, this will require a fair amount of restructuring, since the code right now starts by finding the set of datasets that may have changed, and then runs diffs on all of them separately.

ritikBhandari commented 2 years ago

Is this to be resolved by contributors selected for GSOC or anybody can do it?

olsen232 commented 2 years ago

I think that since we've taken the time to write it up as GSOC project, we should now reserve it for a selected GSOC contributor. If we don't reserve it there's a risk that by the time GSOC starts properly, there'll be no projects left to do - which the GSOC people won't be happy with [Edit] GSOC has started and we ended up with a single GSOC candidate working on a different project

subho004 commented 2 years ago

I think that since we've taken the time to write it up as GSOC project, we should now reserve it for a selected GSOC contributor. If we don't reserve it there's a risk that by the time GSOC starts properly, there'll be no projects left to do - which the GSOC people won't be happy with

I'd like to contribute in this issue for GSOC. I've installed kart in my machine and ran some commands. It's mostly similar to git and normal cmd. Myself Subhajit Hait. I'm pursuing Btech in CSE from NSEC. I'm attaching my linkedin profile http://www.linkedin.com/in/subho004 for further information if required. Also I'd like to mention that this is my first time in GSOC but I believe I've self taught myself enough skills to be eligible and if not then I'll learn them. I'm an efficient learner and explorer.

olsen232 commented 1 year ago

If you extract the polygons.tgz test repo, and run kart show you can see that a) that test repo has an attached file, and b) the kart show command successfully shows you the attached file changed during the import. Running kart show --diff-files will include the changed file's entire contents in the output.

tcwilkinson commented 8 months ago

Thanks all for your work on kart - it's an exciting project. I am looking into making it a more central part of my geospatial workflows.

How far along is this feature?

I can see that files other than the kart repositories can be pulled in from a plain git tweaking from a remote, but is possible yet to edit files / re-stage locally within kart and push them back?

When I try most git commands within a kart repo, I just get an error:

error: index uses kart extension, which we do not understand
fatal: index file corrupt

I see mention of kart commit-files above - Are there hidden kart commands that allow me to stage/commit files (e.g. a README.md)?

olsen232 commented 8 months ago

You are right - attachment support is unfinished, and has been for some time. However, Kart is still maintained and funded and we'll get to it at some stage. These are exciting times for us, there's so many things we are working on... it can take some time for these things to come back around again.

Since some of the pieces are there, you can do the following: Attach a file to your kart repository: kart commit-files path/to/file=@path/to/file -m "Attaching a file" where the first path/to/file is where the path should be stored in your Kart repo and the second path/to/file is the path to the file as it is on your filesystem.

Kart won't track that file, and there's no way to stage it, and Kart won't check that file out into any kind of working-copy. You'll need to run that command again manually whenever you want to attach a new version of that file, or delete that file. Speaking of which, here is how you delete a file: `kart commit-files path/to/file= --delete-empty-files -m "Attaching a file"

Kart commands that show changes - kart diff and kart show - will show changes to files. However, by default it just shows the file hashes eg:

--- attachment.pdf
+++ attachment.pdf
- (file c6b9c1b)
+ (file e69de29)

You can get Kart to include the full file differences inline with an extra flag, --diff-files, or you can just use git to look up the files stored at those hashes (there's no Kart command to do so, but the Git command works fine): git show e69de29 to show the attached file in your terminal git show e69de29 > path/to/file (or whatever is appropriate for your shell) to write the file to the given path.