psychoinformatics-de / knowledge-base

Sources for the psyinf knowledge base
https://knowledge-base.psychoinformatics.de
Other
0 stars 3 forks source link

After push to GIN, remote retains folders that were deleted from dataset #120

Open jsheunis opened 7 months ago

jsheunis commented 7 months ago

Origin: Office Hour chatroom message

Description

User reported:

I have an acquisition computer, an analysis computer and a gin repository. The experiment files are in a subset (rawdata) and pushed to gin, then retrieved from the analysis computer.

Now, I have deleted/restructured the data in the acquisiton computer (deleted, renamed, moved), saved the changes and pushed, but some of the old folders are still there on Gin. All the files are gone, but the folder structure remains on the gin repository and no push will remove them.

Besides that, part of this restructuring was changing folder names from "bla folder" to "bla_folder", and I keep getting the old version in my acquisition computer - so I have "bla folder" on the acquisition computer and cannot get the correct one "bla_folder", even if "bla folder" does not exist in the repository anymore.

@adswa asked to confirm that:

User answer:

The actual files are pushed to Gin. The acquisition computer has the original and ideal version of this dataset. The old folders with outdated names remain in Gin, and are present in my analysis computer. I cannot get their correct versions.

I am new to datalad and so far only using it to transfer data (and have version control) this way, acquisition -> gin -> analysis.

So when I am done acquiring new data, I use save, then update --merge and finally push --to gin. Only the rawdata subdataset is present in the acquisition computer.

From the analysis computer, I update and get whichever files I need to work on.

As for the structure of the datasets, I have superset in the analysis computer, this contains the rawdata subdatasets, and other folders containing code, figures, etc. This has its own Gin repo.

More clarifying questions:

Next steps


TODO (not necessarily to be performed in this order)

alejandrcastro commented 7 months ago

Thanks again, I will answer the questions here.

More clarifying questions:

  • So inside of the rawdata subdataset on the acquisition computer you run:

    datalad save
    datalad update --merge 
    datalad push 

    correct?

  • Can I ask why you run the update --merge?
  • Are you making changes to the raw data subdataset at any other location/clone than the acquisition computer?

The only changes to this dataset are made in the acquisition computer, I was told to always update just in case to avoid conflicts and assumed that worst case scenario this update would just be redundant.

adswa commented 7 months ago

Thanks for the additional info. Its still difficult to piece together precisely what happened. I have tried a few attempts at recreating the situation you describe (in a dataset hierarchy with a sibling on Gin, using mv and git mv and rm on directories or subdatasets, followed by save, update --merge, and push) but I did not observe this issue yet - but this is likely because there simply are some details missing for a reproducer. I'm looking forward to investigating this closer in an office hour, where we can exchange relevant information in real time!

adswa commented 7 months ago

Follow up in the office hour: We got to a productive screensharing session in which everyone got quite confused by what we saw. Here are a few facts:

Acquisition Computer (windows) saves and restructures files; Regular pushes to a Gin sibling; a clone on a mac pulls updates from Gin.

We left with the following recommendations:

General:

Helpers we recommended: