After push to GIN, remote retains folders that were deleted from dataset

jsheunis commented 9 months ago

Origin: Office Hour chatroom message

Description

User reported:

I have an acquisition computer, an analysis computer and a gin repository. The experiment files are in a subset (rawdata) and pushed to gin, then retrieved from the analysis computer.

Now, I have deleted/restructured the data in the acquisiton computer (deleted, renamed, moved), saved the changes and pushed, but some of the old folders are still there on Gin. All the files are gone, but the folder structure remains on the gin repository and no push will remove them.

Besides that, part of this restructuring was changing folder names from "bla folder" to "bla_folder", and I keep getting the old version in my acquisition computer - so I have "bla folder" on the acquisition computer and cannot get the correct one "bla_folder", even if "bla folder" does not exist in the repository anymore.

@adswa asked to confirm that:

the actual files were successfully pushed (i.e., there are on Gin and safely backed up)?
what remains on the acquisition computer are empty directories with outdated names?

User answer:

The actual files are pushed to Gin. The acquisition computer has the original and ideal version of this dataset. The old folders with outdated names remain in Gin, and are present in my analysis computer. I cannot get their correct versions.

I am new to datalad and so far only using it to transfer data (and have version control) this way, acquisition -> gin -> analysis.

So when I am done acquiring new data, I use save, then update --merge and finally push --to gin. Only the rawdata subdataset is present in the acquisition computer.

From the analysis computer, I update and get whichever files I need to work on.

As for the structure of the datasets, I have superset in the analysis computer, this contains the rawdata subdatasets, and other folders containing code, figures, etc. This has its own Gin repo.

Next steps

Wait for user feedback to above questions.

TODO (not necessarily to be performed in this order)

[x] Inform OP/Add reference to this issue at origin
[x] Clarifying Qs asked or not needed
[ ] Nature of the issue is understood
[x] Inform OP about resolution

alejandrcastro commented 9 months ago

Thanks again, I will answer the questions here.

More clarifying questions:
So inside of the rawdata subdataset on the acquisition computer you run:
datalad save
datalad update --merge 
datalad push 
correct?
Can I ask why you run the update --merge?

Are you making changes to the raw data subdataset at any other location/clone than the acquisition computer?

The only changes to this dataset are made in the acquisition computer, I was told to always update just in case to avoid conflicts and assumed that worst case scenario this update would just be redundant.

adswa commented 9 months ago

Thanks for the additional info. Its still difficult to piece together precisely what happened. I have tried a few attempts at recreating the situation you describe (in a dataset hierarchy with a sibling on Gin, using mv and git mv and rm on directories or subdatasets, followed by save, update --merge, and push) but I did not observe this issue yet - but this is likely because there simply are some details missing for a reproducer. I'm looking forward to investigating this closer in an office hour, where we can exchange relevant information in real time!

adswa commented 9 months ago

Follow up in the office hour: We got to a productive screensharing session in which everyone got quite confused by what we saw. Here are a few facts:

Acquisition Computer (windows) saves and restructures files; Regular pushes to a Gin sibling; a clone on a mac pulls updates from Gin.

The Gin webinterface has a bug - folders created and pushed from a windows machine, and later renamed and pushed again do not get removed in the webinterface' index. In this minimal reproducer, "folder" was renamed to "newname" and "folder" should not exist in the webinterface, but lingers around. (overall: confusing, but with no impact on the the clone)
The local clone on the mac was in a convoluted state (we couldn't figure out how it got there, but it was a mix of a very updated index, a detached HEAD, and unmerged branched - likely the Gin confusion contributed to that). Also, the repository reported on a background garbage collection process that looked a bit shady. And finally, an icloud backup process to the cloud created duplicated files (HEAD 2, index 2, ...) in the .git/ directory.
Recloning the repository from Gin fixed the issue

We left with the following recommendations:

General:

install datalad-next and enable it via config on the windows machine (because a status there is fast)
consider restructuring the dataset on the acquisition machine
Gin interface seems to be the issue - if in doubt, ignore

Helpers we recommended:

tell icloud not to touch Git repos
Install tig/gitk/zsh extension (e.g., https://github.com/romkatv/powerlevel10k) to visualize branches and index more easily

psychoinformatics-de / knowledge-base

After push to GIN, remote retains folders that were deleted from dataset #120

Description

Next steps