gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 15 forks source link

Investigate whether we can automatically detect changes in subdir at root of public area #639

Open rija opened 3 years ago

rija commented 3 years ago

Can use use a script attached to a inotify mechanism

This task is part of Story #600

pli888 commented 3 years ago

The Tencent back up script is executed on a server which contains a NFS mount of the directory containing the GigaDB dataset files. The issue is that inotify does not appear to be compatible with NFS directories - see here. People suggest running a service on the GigaDB file storage server that brokers inotify requests to the back up script server or use a message queue system.

Some more details on the use of inotify is available in the back up procedure document.

rija commented 3 years ago

Hi @pli888,

I see. To use message queue we would need to have access to the machine exposing the NFS share which may not be possible.

Another approach is incremental backup:

Instead of copying dataset files that we know have changed, we always copy their fixed parent directory and the sync tool figures what are the new files (they will use md5 to skip files arleady uploaded)

extract from coscmd docs:

#Command syntax
coscmd upload -r <localpath> <cospath>
#Example: Upload the "doc" folder in D drive to the root directory of COS.
coscmd upload -r D:/doc /
#Upload the folder to the "doc" directory of COS.
coscmd upload -r D:/doc doc
#Upload the folder synchronously while skipping those with the same MD5 value.
coscmd upload -rs D:/doc doc
#Upload the folder synchronously and delete files that are deleted in the "doc" folder in D drive.
coscmd upload -rs --delete D:/doc /
#Ignore uploading files whose extension is .txt or .doc in the "doc" folder in D drive.
coscmd upload -rs D:/doc / --ignore *.txt,*.doc

https://intl.cloud.tencent.com/document/product/436/10976#uploading-a-file-or-folder

extract from rclone docs:

Copy the source to the destination. Doesn't transfer unchanged files, testing by size and modification time or MD5SUM. Doesn't delete files from the destination.

Note that it is always the contents of the directory that is synced, not the directory so when source:path is a directory, it's the contents of source:path that are copied, >not the directory name and contents.

https://rclone.org/commands/rclone_copy/

pli888 commented 3 years ago

@rija It seems CNGB's giga_backup.sh script (see google doc) performs an incremental backup. Since the destination backup directory in Tencent Cloud never changes in giga_backup.sh then changed files will overwrite the corresponding file in the backup and it would not be possible to restore previous versions of this file.

Given it's likely we would not be able to access the machine exposing the NFS share then let's use incremental backup automated via a cron job to update GigaDB backups. I think its prudent if I first create some smoke tests to check incremental backup functionality is working as we expect with a test Tencent COS account?

rija commented 3 years ago

Hi @pli888, I believe if versioning is switched on COS (probably worth looking into actual status and documenting whether it's the case or not), then you could restore the versions that have been overwritten. When File Upload Wizard is deployed, it won't be necesary as it backups the file to a different name if it already exists.

Another scenario to think of is what happen if a backup fails mid-way and partial file ends up in the destination bucket. Presumably, next round of backup will overwrite the partial file since MD5 checksum won't match.

In any case, having smoke tests covering all the use cases is a good idea. We probably should design them so that when we swap between coscmd and rclone they can still run unchanged and therefore validate that we can replace coscmd with another tool when those tests still pass.

With incremental backup in use, a cronjob should be indeed enough for automation.

pli888 commented 3 years ago

Hi @rija

I believe if versioning is switched on COS (probably worth looking into actual status and documenting whether it's the case or not), then you could restore the versions that have been overwritten.

It won't be possible to restore different versions of files since versioning is not switched on for the bucket provided to us:

$ coscmd getbucketversioning
Not configured

Another scenario to think of is what happen if a backup fails mid-way and partial file ends up in the destination bucket. Presumably, next round of backup will overwrite the partial file since MD5 checksum won't match.

Based on the documentation, if a file upload process fails then the partial file will be overwritten in the next back up process unless versioning has been enabled for the bucket.

In any case, having smoke tests covering all the use cases is a good idea.

To automate smoke tests to check backup functionality, the coscmd tool can be installed into the test container and coscmd is called from PHP using the system() function to upload and list files, etc. Another option is to use the Tencent Cloud Object Storage PHP SDK but then this is not replicating the backup process environment as much as using coscmd. The execution status of system() coscmd operations would then be checked with assertion functions in PHPUnit in the smoke tests. Finally, a bucket fixture in Tencent would be created using Terraform. Would this be a good approach?

rija commented 3 years ago

Hi @pli888,

That sounds good to me. With regards to the location of the smoke test code, I think it's better if the smoke test is co-located with the backup script. So wherever we've put the backup script, next to it we should see the smoke test. (And ideally such location should be on a path where .gitlab-ci.yml can see them, if we want to run the smoke test in the CI pipeline) That way, when we need to look at one, we don't have a mental effort to make to remember where is the other.

pli888 commented 3 years ago

639 has been moved to "Under Review" since we have ascertained that we cannot use inotify to auto-detect changes in GigaDB's public directory. Instead, we are looking into using incremental backups for backing up GigaDB data onto Tencent COS.