opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

Investigate moving data between HBase Clusters #961

Open martinpeck opened 5 years ago

martinpeck commented 5 years ago

Many clinical genomic projects have researchers associated to study all the genome data collected. In this scenario we can identify two main big use cases:

We need to investigate how we make the different big data snapshots available to researchers in an optimal way.

Option 1 - Replication OpenCGA implements export/import functionality. Data could be exported every 6 months and imported in a second installation in read-only mode. By doing this there are two copies of the data in different installations with different users and goals,

Option 2 - Only Spark OpenCGA can export data in Apache Parquet format to be analysed with Spark. OpenCB Oskar (https://github.com/opencb/oskar) implements a genomic analysis spark library and notebooks for researchers to analyse data. This option is easier to setup and there is less data replicated, but researcher will need to query the OpenCGA installation for some queries, there could be some problems between users.

PAOLT commented 5 years ago

Different approaches can be considered depending on the use case:

Different workloads might be generated by different user bases, as far as their reference use case changes:

Based on the above differentiation it is important to understand at what extent those categories require the same tooling (HBase).

Provided that researchers need HBase at least for a part of their job, their HBase concerned use case can be supported by the same HBase cluster used for clinical testing?

In case it can:

In case it cannot a new cluster should be provisioned, and the following aspects should be evaluated:

This process has the benefit of:

It is highly likely that researchers will need a different toolset – i.e. Spark or equivalent analytic platforms. Azure analytics platforms (i.e. HDInsight, Azure DataBricks, Azure Data Lake, Azure Datawarehouse) can share the same data lake (if any) with HBase and have good ways to source data from there.

PAOLT commented 5 years ago

So key questions would be: 1) there are 3 stages of transformation for genomics files - do they happen in HBase or outside? Does a raw data lake exists before HBase? What the data processing chain looks like? 2) which tooling does researchers need? 3) can they share the same HBase cluster + something different (i.e. Data Bricks)?

PAOLT commented 5 years ago

Some useful links:

Setting up Backup and Replication for HBase and Phoenix on HDInsight https://github.com/ZoinerTejada/hdinsight-docs/blob/master/hdinsight-hbase-backup-replication.md

Working with the HBase Import and Export Utility https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/

Copy Activity performance and tuning guide https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance

martinpeck commented 5 years ago

@PAOLT to have discussion with @j-coll @imedina

PAOLT commented 5 years ago

I discussed with @j-coll about this.

The data flow looks like the following:

Business requirements

How data is syncronized between the two clusters?

from a business perspective it is acceptable to sync the two environments 2-3 times / year. This is resulting in a very large batch, but it is affordable to wait for few days. Alternative would be to perform a dual-load, which is pretty typical on these kind of environments. Indeed, it would streamline the data loading process (fresh data available sooner + less critical sync process) at the price of changing the application. Given the benefits of dual-loading are not a requirement, it is good to go with the batch / occasional approach for the moment.

What Platform is going to be used for analytics?

Spark with Parquet files. Two flavours of Spark exist on Azure (beside installing dedicated VMs clearly):

Next steps

1) Verify Data Bricks peering with the current HBase set-up 2) put under the lens the large-batch mechanism for replicating data (it might be huge), provided Azure Data Lake is built from the ground up to provision very high IO throughput

Data sync v1.pptx

PAOLT commented 5 years ago

Summary of phase 1 is:

Main differences between the two approaches would be the following - a test to validate them, evaluate impact more clearly and test performance would be required.

lawrencegripper commented 5 years ago

Here is @bart-jansen s guide on how to get the basic one-box setup running in Azure for testing https://github.com/opencb/opencga/tree/azure/opencga-app/app/scripts/azure

PAOLT commented 5 years ago

Tried the above guide on UK WEST, till application install steps - worked nice. Created an HDInsight cluster with Azure DataLake Gen2 (preview)

Useful links for installing above services:

martinpeck commented 5 years ago

@martinpeck to share details on loading VCF to @PAOLT @PAOLT requires help from @j-coll to run the "export tool"

wbari commented 5 years ago

@PAOLT I found this documentation for export part, http://docs.opencb.org/pages/viewpage.action?pageId=15597876 may be It helps

lawrencegripper commented 5 years ago

@wbari found the following doc on the import/export process, this maybe something we can try in the meantime http://docs.opencb.org/pages/viewpage.action?pageId=15597876

lawrencegripper commented 5 years ago

I believe that this issue with be blocked by the bug seen here #1069. Until this is resolved one option would be to test with non-gen2 storage as we know this can work. However, this may not give us useful information if the performance is significantly different between the generations.

PAOLT commented 5 years ago

I went through this. Still getting run time errors when doing the transform, but the previous steps worked with some modifications that I'm sharing here below:

Test OpenCGA.docx

lawrencegripper commented 5 years ago

This PR: https://github.com/opencb/opencga/pull/1079 may resolve your runtime issues, if you have a chance to pull the changes and re-attempt the import with that build it would be useful to see if it resolves.

PAOLT commented 5 years ago

Solution

The below code allows to reproduce how to successfully export data from HBase to Azure Data Lake Gen2, by means of the existing export tool.

The tool that has been tried is here

Process to provision services

Services have been provisioned to UK West

Create a VM

Create a User Assigned Managed Identity

Create a Storage Account:

Create an HDI cluster

Select custom (size, settings, apps)

Basics

Security + Networking

Storage

Cluster size: any

Configure VM

Follow instructions here

Run Setup resources section

Clone OpenCGA and build it with Hadoop

git clone -b azure https://github.com/opencb/opencga.git cd opencga mvn clean install -DskipTests -Dstorage-hadoop -Popencga-storage-hadoop-deps -Phdp-2.6.5 -DOPENCGA.STORAGE.DEFAULT_ENGINE=hadoop

Continue with instructions here

Test the application

The following statements consent to test the application with a VCF file

Install catalog (remember password for later use) sudo /opt/opencga/bin/opencga-admin.sh catalog install --secret-key SeCrEtKeY

cd to the application directory cd /opt/opencga

Start daemon for debugging sudo /opt/opencga/bin/opencga-admin.sh catalog daemon --start

Open a seperate terminal or use screen, and cd to the application directory cd /opt/opencga

Create a user (in the new terminal) sudo /opt/opencga/bin/opencga-admin.sh users create -u test --email test@gel.ac.uk --name "John Doe" --user-password testpwd

Login to get session token sudo /opt/opencga/bin/opencga.sh users login -u test

Create project sudo /opt/opencga/bin/opencga.sh projects create --id reference_grch37 -n "Reference studies GRCh37" --organism-scientific-name "Homo sapiens" --organism-assembly "GRCh37" Create a study within your project sudo /opt/opencga/bin/opencga.sh studies create --id 1kG_phase3 -n "1000 Genomes Project - Phase 3" --project reference_grch37

wget a vcf genome file wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Link vcf genome file (with absolute path to vcf file) sudo /opt/opencga/bin/opencga.sh files link -i /home/test/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -s 1kG_phase3

Transform -> view progress in DAEMON terminal sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir

Load -> view progress in DAEMON terminal sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --load -o outDir

Query: sudo /opt/opencga/bin/opencga-analysis.sh variant query --sample HG00096 --limit 100

Test the export tool

The export tool is stored here

Run the following from HDI head node. cd opencga export HADOOP_USER_CLASSPATH_FIRST=true hbase_conf=$(hbase classpath | tr ":" "\n" | grep "/conf" | tr "\n" ":") export HADOOP_CLASSPATH=${hbase_conf}:$PWD/libs/avro-1.7.7.jar:$PWD/libs/jackson-databind-2.6.6.jar:$PWD/libs/jackson-core-2.6.6.jar export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$PWD/libs/jackson-annotations-2.6.6.jar

Run the export tool with yarn jar ~/opencga/opencga-storage-hadoop-core-1.4.0-rc3-dev-jar-with-dependencies.jar org.opencb.opencga.storage.hadoop.variant.io.VariantExporterDriver opencga_test_reference_grch37_variants study test@reference_grch37:1kG_phase3 --of parquet_gz --output my.variants.parquet

martinpeck commented 5 years ago

@j-coll to test this solution once we have a larger/more significant set of data to test this with. Blocked until that's the case.

PAOLT commented 5 years ago

The above procedure has been tested successfully on the last committed Azure branch on Jan 31st. I tried to reproduce it on the Azure branch committed at the end of that day but I encountered some issues - most specifically the statement for making the transform

sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir

started correctly but once reached 65% it stopped working without any observed feedback in the catalog daemon standard output. Any tentative to restart it with --re didn't produce any related output on the std output scrolling.