psychoinformatics-de / knowledge-base

Sources for the psyinf knowledge base
https://knowledge-base.psychoinformatics.de
Other
0 stars 3 forks source link

Perform a "backup" of a nested dataset hierarchy on a crippled-fs harddrive #62

Open mih opened 1 year ago

mih commented 1 year ago

Origin: DataLad office hour chat 2023-05-08

Basically I need to work remotely so trying to clone my entire dataset onto a harddrive and from the harddrive onto my personal laptop. While doing so, I thought it'd be smart to do a "backup" by getting all the content on the harddrive.

While performing a recursive get of a superdataset clone (multiple subdatasets) onto a crippledFS external harddrive, the user aborted the command and was left of modified dataset clones.

TODO (not necessarily to be performed in this order)

Capturing relevant pieces from my reply:

Instead of getting a nested hierarchy of a single version snapshot of your data, it would actually be a full backup (all data, all versions), and it would not suffer from the limitations of your hard-drive file system as much (unverified speculation).

The downside is that it won't look as pretty

But this is our standard solution for collaboration (push/pull) using a location that is not ready for git-annex

if you like papers more than online handbooks: https://doi.org/10.1038/s41597-022-01163-2

Roughly summarizing the difference between what you tried and what this different approach would mean:

This means you will work exclusively in your main dataset clone.

The resulting "RIA store" on the harddrive, can be added to other existing clones as a remote, and they will be able to pull data from it. You would be able to continue to push data (new versions) onto the drive, without having to replace/delete anything (until you run out of space) (At which point you can detect and cleanup versions you no longer need).

RIA stores also support compressed archives -- so your harddrive might last for quite a bit

CAUTION: I am not aware of anyone having actually tried putting a RIA store on an external harddrive with a non-POSIX filesystem. I expect this to work, but there is no hard evidence for this claim.

mslw commented 1 year ago

This had a follow up in the office hour chat and the office hour today. Out of multiple subdatasets, most were pushed to the RIA without an issue, but two did not:

Push to 'ria-backup':
CommandError: 'git -c diff.ignoreSubmodules=none annex copy --batch -z --to ria-backup-storage --fast --json --json-error-messages --json-progress -c annex.dotfiles=true' failed with exitcode 1 under /media/(--redacted--) [info keys: stdout_json]
> to ria-backup-storage...
  content changed while it was being sent
  This could have failed because --fast is enabled. [733 times]
git-annex: copy: 733 failed

There was an issue about the same error https://github.com/datalad/datalad/issues/5613 which was solved by upgrading git-annex (to 10.20220128) - since the current issue was reported using an older version, we need to wait and see if an update solves the problem.

jsheunis commented 1 year ago

IMO there are two issues here:

  1. Perform a "backup" of a nested dataset hierarchy on a crippled-fs harddrive (defined by the issue title). I think @mih's response could be the basis of a KBI on this topic, supported by some code examples and ideally a confirmation that this all works on a non-POSIX external harddrive.
  2. the content changed while it was being sent issue, which is has a thorough writeup in the form of https://github.com/datalad/datalad/issues/5613 and which ideally also solves this user's problem (UPDATE: user has confirmed that upgrading to the latest version of git-annex solved the problem)
jsheunis commented 1 year ago

Paraphrased steps followed by the user:

Creating the backup

  1. Remove any tried-but-failed clones (using methods other than RIA siblings) from the external hard-drive
  2. Create the RIA sibling (name: ria-backup, alias: ria-alias) on the external hard-drive (recursively, since the superdataset has nested subdatasets): datalad create-sibling-ria -s ria-backup --alias ria-alias --new-store-ok ria+file:///<path-to-location-on-external-hard-drive> -r
  3. Push content to the RIA sibling recursively: datalad push --to ria-backup -r

Cloning from the backup

  1. Connect the hard-drive to a machine on which to make a clone (e.g. pc, laptop)
  2. Clone from the RIA store: `datalad clone ria+file:////riastore#~ria-alias (JSH comment: is this correctly formatted?)
  3. To install subdatasets, get them as per usual: datalad get <relative-location-to-subdataset>

Fetching updates

  1. If the clone on a pc or laptop grows with commits that need to go back to the data origin via the external hard-drive, they can pushed to the hard-drive first: datalad push --to ria-backup
  2. Then connect the hard-drive to the original data source, and fetch/merge updates: datalad update --merge
jsheunis commented 1 year ago

Most of the above is explained in https://handbook.datalad.org/en/latest/beyond_basics/101-147-riastores.html, but I think this compact use case can still stand on its own as a KBI.

christian-monch commented 1 year ago

More traffic on this issue today. datalad status generated the error: Unknown commit identifier: master was generated. Asked follow up questions to the OP, no answer yet.

I have a windows machine and will spend some time looking into ria-stores on NTFS

jsheunis commented 1 year ago

More traffic on this issue today. datalad status generated the error: Unknown commit identifier: master was generated. Asked follow up questions to the OP, no answer yet.

User reported that this was no longer an issue for them (they didn't have to use that solution anymore, and they won't be spending time debugging it anymore). So for the purpose of solving the user's problem, this issue is not needed anymore. But for the purpose of writing a KBI, this issue can remain open, pending a test on a windows system and the KBI writeup.