gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 15 forks source link

Create documentation for using excel ingestion and post upload tools on bastion server #2018

Closed pli888 closed 1 week ago

pli888 commented 2 months ago

Pull request for issue: #1968

This is a pull request for the following functionalities:

How to test?

Dev environment

Perform steps in the Test curator tools in dev environment section in docs/CURATOR_TOOLS_BASTION.md.

Staging environment

Deploy your staging environment as usual:

$ ops/infrastructure/envs/staging
# Copy terraform files to staging environment
$ ../../../scripts/tf_init.sh --project gigascience/forks/<your>-gigadb-website --env staging --region <your-region> --ssh-key <path to your pem>

# Provision with Terraform 
$ terraform plan
$ terraform apply
$ terraform refresh

# Copy ansible files
../../../scripts/ansible_init.sh --env staging

# Provision webapp ec2 server with ansible
env TF_KEY_NAME=private_ip OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ansible-playbook -i ../../inventories webapp_playbook.yml -e "gigadb_env=staging"

When you deploy your bastion server, the following command will restore your staging RDS to a specific date in the past which does not contain the dataset whose Excel file will be ingested for testing:

# Restore RDS to specific backupDate using databaseReset.sh with backup file sourced from pli888's EC2 FTP server
env OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ansible-playbook -i ../../inventories bastion_playbook.yml -e "backupDate=20240101" -e "gigadb_env=staging"

The gigadb-database-backup S3 bucket already contains backup files that the above command will use:

You will also need to create a curator user account on your bastion server, e.g.

# Create curator user peterl on bastion server
ansible-playbook -i ../../inventories users_playbook.yml -e "newuser=peterl" -e "credentials_csv_path=~/Desktop/credentials-peterl.csv" -e "gigadb_env=staging"

Run the Gitlab CI/CD pipeline for this feature branch.

To test Excel file ingestion and post-upload operations, perform the steps described in the Staging/Live section in docs/CURATOR_TOOLS_BASTION.md.

These steps have been tested to work both as the centos user and a curator user.

How have functionalities been implemented?

Remove use of S3 gigadb-datasets-metadata bucket

The md5.sh script does not copy doi.md5 and doi.filesizes into the S3 gigadb-datasets-metadata bucket anymore. Instead, these files are copied into /var/share/gigadb/metadata/ directory in the bastion server. On your dev environment, files-metadata-console/docker-compose.yml maps the /var/share/gigadb/metadata/ directory onto files-metadata-console/tests/_data/var/share/gigadb/metadata/.

The DatasetFilesUpdater component class in files-metadata-console has been updated to read file md5 values and file sizes from /var/share/gigadb/metadata/doi.md5 and /var/share/gigadb/metadata/doi.filesizes too.

Move tools into files metadata console

Md5 value update functionality used to be present FilesCommand in the protected directory. This functionality has been moved into files metadata console which involved adding updateMD5FileAttributes() function into DatasetFilesUpdater component class. The UpdateController class in files metadata console also has a new actionMd5Values() function which calls the updateMD5FileAttributes() function. The File class in gigadb/app/models has been updated with updateMd5Checksum($md5_value) function.

filesMetaToDb.sh and postUpload.sh moved from excel spreadsheet uploader tool to files metadata console because they are related to update file metadata, not ingestion of Excel files.

Improved error handling in md5 value and file size update functionality

Error handling has been improved so that any file listed in doi.md5 and doi.filesizes but not found in database is reported in the console. Processing of the remainder of files in doi.md5 and doi.filesizes continues if a file not found error is encountered, where as before this did not happen.

A file size update bug has been fixed due to wrong delimiter used in doi.filesizes files which should be tab delimited. The md5.sh script replaces the single space separating file size and file name with a tab character.

Fully working postUpload.sh

The postUpload.sh script is fully working, starting from readme file creation, copying readme file into user dropbox, calculation of md5 and file sizes and finally to the updating of md5 values and file size in the database.

Any changes to automated tests?

Unit and functional tests in files metadata console have been updated. These tests can be executed as follows:

# Run unit tests
$ cd gigadb/app/tools/files-metadata-console
$ docker-compose run --rm files-metadata-console ./vendor/codeception/codeception/codecept run --debug tests/unit/
$ docker-compose run --rm files-metadata-console ./vendor/codeception/codeception/codecept run --debug tests/functional/

There is a new bats test for postUpload.sh:

$ bats postUpload

Any changes to documentation?

Information about using command line tools on bastion server has been created in docs/CURATOR_TOOLS_BASTION.md. Curators can copy, paste this documentation into their Giga internet curator documentation page.

pli888 commented 1 month ago

@kencho51

chore: Should mention in the CURATOR_TOOLS_BASTION.md that the gitlab pipeline has been completed and passed before executing the tools in production, otherwise the some images, eg, production-files-metadata-console:staging is not updated and got errors like this:

The documentation in CURATOR_TOOLS_BASTION.md was written for curators so I would prefer not to mention about gitlab pipeline has to be completed. From a curator's point of view, we assume the gitlab pipeline has run to completion and the bastion server contains the latest version of the file metadata console.

question(non-blocking): Do we need to update the usage at here: https://sites.google.com/d/1gaSPM1UIlCPWwgDzNbNQpVMAYf9kYZeO/p/1ecqjfPusM9yp4wovHZF70ypHvbcMcHRC/edit?pli=1&authuser=1

Yes we do. I will do this when this PR has been approved and merged.

question: Is it necessary to create /tests/var/share/gigadb/metadata dir for the docker cmd, Can it atly be replaced by tests/_data/dropbox/$user? These two dir seem duplicated.

/tests/_data/var/share/gigadb/metadata and tests/_data/dropbox/$user are two different directories. tests/_data/dropbox/$user represents a user dropbox. /tests/var/share/gigadb/metadata is the directory where .filesizes and .md5 files are stored in the bastion server for files metadata console tool to use to update dataset file md5 value and filesizes in the database.

only1chunts commented 1 month ago

Question:

Not only are the 102498.md5 and 102498.filesizes files created in the user5 dropbox but they are also in the /var/share/gigadb/metadata/ directory on the bastion server.

Does this mean that there are two different copies of the filesize and md5 values? If so, I thought we had previously agreed that there would only be 1 version of those files. There is a real danger that curators will not realise a second hidden copy exists somewhere else other than the userbox and therefore not know that any manual changes made in those files in the userbox will be useless. Do we really need two copies of those files? If the scripts absolutely require those files in the /var/share/gigadb/metadata/ directory then we should not have a redundant copy in the userbox. However, if the scripts could use the userbox copy instead of the ones in /var/share/gigadb/metadata/ directory then we should adjust the scripts and remove the need for the hidden copy there.

pli888 commented 1 month ago

As discussed in sprint status meeting, the database update of file metadata will default to using doi.filesizes and doi.md5 located in user dropbox directories.

pli888 commented 3 weeks ago

question: What does "Send gigadb.org link" on the diagram mean?

This is the step when I send the link to the dataset admin page to the curators after the ingestion process is complete. The new diagram replaces this step with Go to dataset admin page on gigadb.org

issue (non-blocking): it would better if the sections of the docs where exactly the legend under the step number in the diagram (1. datasetupload, 2. createreadme, ...)

This has been implemented in gigadb-website/docs/curators/CURATOR_TOOLS_BASTION.md and gigadb-website/docs/developers/CURATOR_TOOLS_BASTION.md.

suggestion: Have two docs, one in docs/developers/ and one in docs/curators/ with only the relevant information for each.

This has been implemented in gigadb-website/docs/curators/CURATOR_TOOLS_BASTION.md and gigadb-website/docs/developers/CURATOR_TOOLS_BASTION.md.

issue: in the staging and live deployment, there is no need to describe how people login as everyone does it it's own way (the ssh keys are not always in the same place, some ssh from windows, some from mac)
suggestion: remove the parts where there are more than one way to do it.

Done in gigadb-website/docs/curators/CURATOR_TOOLS_BASTION.md.

issue: "To continue with the remainder of this workflow, create a directory at /share/dropbox/user5 using the centos user. This user5 directory will act as an example user dropbox. Add the following two files into the user5 directory:"

This is now only done in gigadb-website/docs/developers/CURATOR_TOOLS_BASTION.md for developer testing.

issue: "The readme file in the uploadDir directory needs to be copied into the user dropbox:"
I can foresee curators making mistakes and/or grumbling on that task
suggestion: can we have all the tools created artefacts that matter to curators generated in the same directory that they run the script on ?

Now, all command-line tools work from a user dropbox directory. The workflow diagram has been updated to reflect this change. Command-line tools create artefacts, e.g. readme files, doi.filesizes, etc in a container directory which is mounted to the current working directory on the host. This pattern allows tools to work from a user dropbox directory.

nitpick: the MD5 calculation can take a very long time, I wonder whether we can use Gum to display a spinner to continuously show feedback to the curators, so they don't have to wonder whether the tool is working or not

md5 calculation is now wrapped with a call to the gum tool to display a spinner when this process is running. This required installing gum in the files-metadata-console dockerfile. For staging and live environments, gum is already installed on the bastion server.

thought: I wonder if it's best to hand over the tools gradually to the curators, to avoid curator brain overload and having to support them on all the tools at once. we could start by handing over (one by one?) the tools that constitutes the PostUpload workflow to start with. Now that there is a diagram, they shouldn't have problem figuring the place of each new tool in the grand scheme of things

Sure, we can discuss this with @onlyonechunts.

All typos have been fixed

pli888 commented 3 weeks ago

Testing

Dev environment

To test on dev environment, follow instructions in gigadb-website/docs/developers/CURATOR_TOOLS_BASTION.md.

Staging environment

To test on your staging server:

# Spin up your staging environment
../../../scripts/tf_init.sh --project gigascience/forks/your-gigadb-website --env staging --region your-region --ssh-key ~/.ssh/your-key.pem

# Provision with Terraform 
AWS_PROFILE=Name terraform plan
AWS_PROFILE=Name terraform apply
AWS_PROFILE=Name terraform refresh

# Copy ansible files
../../../scripts/ansible_init.sh --env staging

# Run webapp playbook
env TF_KEY_NAME=private_ip OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ansible-playbook -i ../../inventories webapp_playbook.yml -e "gigadb_env=staging"

# Run bastion playbook
env OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ansible-playbook -i ../../inventories bastion_playbook.yml -e "gigadb_env=staging"

# Create curator user on bastion server
ansible-playbook -i ../../inventories users_playbook.yml -e "newuser=peterl" -e "credentials_csv_path=~/Desktop/credentials-peterl.csv" -e "gigadb_env=staging"

Copy over files for testing onto staging bastion server:

$ pwd
/path/to/gigadb-website
$ scp -i ~/.ssh/your-key.pem -r gigadb/app/tools/files-metadata-console/tests/_data/dropbox/user5 centos@your-staging-ip:/share/dropbox

Check files have been copied onto staging server:

$ ssh -i ~/.ssh/your-key.pem centos@your-staging-ip
[centos@ip-10-99-0-120 ~]$ ls /share/dropbox/user5/
DLPFC_69_72_VNS_results.csv  E2_VNS_Ground_Truth.csv

Then follow docs/curators/CURATOR_TOOLS_BASTION.md using your curator user account on bastion.

pli888 commented 1 week ago

issue: I think the step 2 section (Change directory to /share/dropbox/user directory) as it is, is unnecessary as it is neither a tool nor a process. I would have put the change directory command as a pre-amble/pre-requisite to all the subsequent sections that require the curator to be located in the user dropbox. Also the current step 2 in the diagram doesn't feel right as it seems to indicate that something is taken from the database and put in the user dropbox which is not correct.

There is a banner at the beginning of each post-upload step which tells the user to ensure they are in a /share/dropbox/user directory. The diagram has also been updated accordingly.

typo: grammar of From this the user dropbox directory in 3. createReadme section

This typo has been corrected.

question: is there a reason for the curators to not use the parameters --wasabi --apply --use-live-data on createReadme?

The curator documentation has been updated in Step 2. createReadme to include how using these flags can enable the readme file to be uploaded into Wasabi.

question: when running sudo /usr/local/bin/createReadme --doi 102498 --wasabi --apply (both as centos and non-centos users), I have the following error in /var/log/gigadb/readme.log:

I had the same problem on my staging server which is caused by rclone not being able to find the AWS credentials. To fix this, these variables have been added to the createReadme script:

export AWS_SHARED_CREDENTIALS_FILE=/home/centos/.aws/credentials
export AWS_PROFILE=wasabi-transfer

issue: I think the workflow as described in the docs and in the diagram is missing the final steps:

  1. transfer --wasabi --backup that upload dataset files to Wasabi and back them up to S3 glacier

The above has been added as step 6.transfer- copy dataset files into Wasabi into docs/curators/CURATOR_TOOLS_BASTION.md.

  1. Housekeeping of user dropboxes of published datasets (curator doc should mention curators needs delete user5 and user5.orig after ensuring that the files have been backed up to S3 glacier in previous step)

This has been added as step 7. Housekeeping of user dropboxes of published datasets into docs/curators/CURATOR_TOOLS_BASTION.md.

  1. Publish dataset and Mint a DOI

I've not added this into curator documentation since I'm not completely sure what they do, after reading the curator docs on Google sites.

typo: ## 5. Run filesMetdaToDb to update file

This has been fixed in step 4 in curator documentation.

suggestion: the step 6 section should probably mention that curators can start creating mockup pages in the admin dashboard in order to preview the final dataset view page with the information that was added to the database in previous steps

This has been added into 5. Go to dataset admin page on gigadb.org in curator documentation.

issue (non-blocking): when using sudo, curators have to use the full path to the command (e.g: /usr/local/bin/calculateChecksumSizes). It would be better I think if they can just type the command name, for clarity and speed, and to further reduce opportunities for mistakes.

There is now a step in ops/infrastructure/roles/bastion-users/tasks/main.yml which creates a /etc/sudoers.d/89-defaults file so curators do not have to type full path to script for its execution. The documentation has been updated accordingly.