martinpeck commented 5 years ago

Many clinical genomic projects have researchers associated to study all the genome data collected. In this scenario we can identify two main big use cases:

clinical: data from patients are being loaded continuously, annotated and interpreted on a daily basis.
research: stable data snapshot are needed for researchers to have time to analyse the data. This is read-only data.

We need to investigate how we make the different big data snapshots available to researchers in an optimal way.

Option 1 - Replication OpenCGA implements export/import functionality. Data could be exported every 6 months and imported in a second installation in read-only mode. By doing this there are two copies of the data in different installations with different users and goals,

Option 2 - Only Spark OpenCGA can export data in Apache Parquet format to be analysed with Spark. OpenCB Oskar (https://github.com/opencb/oskar) implements a genomic analysis spark library and notebooks for researchers to analyse data. This option is easier to setup and there is less data replicated, but researcher will need to query the OpenCGA installation for some queries, there could be some problems between users.

PAOLT commented 5 years ago

Different approaches can be considered depending on the use case:

Different workloads might be generated by different user bases, as far as their reference use case changes:

Researchers tend to look at population wide patterns, so they need to look at data from many individuals in different categories
Clinical testers tend to look at a sample for a single individual mostly.

Based on the above differentiation it is important to understand at what extent those categories require the same tooling (HBase).

Provided that researchers need HBase at least for a part of their job, their HBase concerned use case can be supported by the same HBase cluster used for clinical testing?

In case it can:

Scale-out the main cluster (same tables)
Build and feed dedicated tables with their own resources (application driven)

In case it cannot a new cluster should be provisioned, and the following aspects should be evaluated:

the overall ingestion process generates a data lake or is it going to feed HBase directly? In case a data lake exists, both clusters can source raw data from the same data lake.
Is the researcher cluster good to start from raw data, or the clinical cluster performs some data processing tasks that are useful for researcher as well?
In case the clinical data performed good work for researchers, then the most interesting process seems to be:  Export data from clinical HBase and compress  Move data to the researcher cluster  Import data into the researcher cluster

This process has the benefit of:

Research HBase import ready-to-go, compressed data (vs. re-indexing raw data)
The Export tool supports filters, so multiple files can be moved in parallel. It is not clear if it can align exported files to table partitions (vs. ingesting application logic)
The data movement can be achieved with ADF vs Copy Activity that shows good base performance https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance. From published metrics it is clear that moving 2PB is challenging and good performance tuning / parallelization should be evaluated.
Alternatives could be based on Snapshots (looks less flexible) or Replication (that usually supports different scenarios – i.e. DR: is this a conversation at all?)

It is highly likely that researchers will need a different toolset – i.e. Spark or equivalent analytic platforms. Azure analytics platforms (i.e. HDInsight, Azure DataBricks, Azure Data Lake, Azure Datawarehouse) can share the same data lake (if any) with HBase and have good ways to source data from there.

In case (again) HBase performs good work for researchers, or a data lake doen't exists, the same considerations as above apply (export)
In case the above point don't hold, the given an analytics platform should source data from the data lake raw files and build its own analytic indexes

PAOLT commented 5 years ago

So key questions would be: 1) there are 3 stages of transformation for genomics files - do they happen in HBase or outside? Does a raw data lake exists before HBase? What the data processing chain looks like? 2) which tooling does researchers need? 3) can they share the same HBase cluster + something different (i.e. Data Bricks)?

PAOLT commented 5 years ago

Some useful links:

Setting up Backup and Replication for HBase and Phoenix on HDInsight https://github.com/ZoinerTejada/hdinsight-docs/blob/master/hdinsight-hbase-backup-replication.md

Working with the HBase Import and Export Utility https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/

Copy Activity performance and tuning guide https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance

martinpeck commented 5 years ago

@PAOLT to have discussion with @j-coll @imedina

PAOLT commented 5 years ago

I discussed with @j-coll about this.

The data flow looks like the following:

A Java process running on top of Azure Batch pull BAM files, transform them and push VCF files (containing only mutations) to HBase via PUT calls
Data is enriched in HBase, so the complete / business relevant data are available in HBase only
HBase data is also indexed in SOlr for fast query

Business requirements

Researcher's activites are confirmed to be substantially different wrt clinical testers - analytics vs. operations
Genomics England is concerned about researchers workload to impact on clinical tests i.e. when one of this is due quickly.
So the request is to separate the two workloads, while researchers might still leverage the testers clusters for non analytical activities
In addition researchers require stable data (vs. fresh but potentialy unstabe data).

How data is syncronized between the two clusters?

from a business perspective it is acceptable to sync the two environments 2-3 times / year. This is resulting in a very large batch, but it is affordable to wait for few days. Alternative would be to perform a dual-load, which is pretty typical on these kind of environments. Indeed, it would streamline the data loading process (fresh data available sooner + less critical sync process) at the price of changing the application. Given the benefits of dual-loading are not a requirement, it is good to go with the batch / occasional approach for the moment.

What Platform is going to be used for analytics?

Spark with Parquet files. Two flavours of Spark exist on Azure (beside installing dedicated VMs clearly):

HDInsight provides Spark within a managed cluster
Azure Data Bricks provides a managed Spark service The first option is in line with the HBase: two managed clusters would exist, one for HBase and one for SPark. Both sharing the same HDFS repository on Azure Data Lake
Azure Data Bricks as a managed service runs in its own network with peering mechanisms available. It will be able to access to the same HDFS than HBase on Azure Data lake. From a management perspective it would be much easier as it is managed as a service (not as a cluster), and potentially much cheaper (in interactive-mode it would be switched off in off work periods).

Next steps

1) Verify Data Bricks peering with the current HBase set-up 2) put under the lens the large-batch mechanism for replicating data (it might be huge), provided Azure Data Lake is built from the ground up to provision very high IO throughput

the current way to export data is based on a map/reduce (doing only the map actually) that runs on each region server and produce a compressed Parquet file being written to HDFS. A region split is enforced every time local data grows over 10GB. So the expectation is to have multiple parallel writes against Azure Data Lake of relatively small files.
although the above process looks good it could be good to test it as soon as some test data will be available (it is a mini-sprint by itself).
in order to start planning for it and anticipate any draw-backs, can the map/reduce program be shared?

Data sync v1.pptx

PAOLT commented 5 years ago

Summary of phase 1 is:

strategy might be to go with a data lake shared among 2 clusters (HBase for clinical tests + Spark for researchers. Spark might be both HDInsight Spark or DataBricks)
the data lake can be based on ADL Store v1, BLOB or ADL Store v2 (preview). The latter is in preview and it is the convergence of ADL v1 and BLOB. Given the level of scale of the solution, ADL would be the way to go (IO throughput).
ADL v2 has a new driver to disintermediate HDFS APIs and ADL v2 REST APIs. This new driver is documented here https://jira.apache.org/jira/browse/HADOOP-15407 and here https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver. It should be used with the actual export solution
In addition a new driver for HBase has been released for Spark (here). This driver would let to write a batch running on a ephemeral cluster, that pull data from HBase and push Parquet files to the data lake.

Main differences between the two approaches would be the following - a test to validate them, evaluate impact more clearly and test performance would be required.

ABFS would require minimal changes to the actual solution (actual scope and effort TBD) vs. the HBase driver solution would require to write a brand new solution
the HBase driver solution would have a more concurrent usage of resources (pull data from HBase and process on another cluster vs process on the HBase cluster)

lawrencegripper commented 5 years ago

Here is @bart-jansen s guide on how to get the basic one-box setup running in Azure for testing https://github.com/opencb/opencga/tree/azure/opencga-app/app/scripts/azure

PAOLT commented 5 years ago

Tried the above guide on UK WEST, till application install steps - worked nice. Created an HDInsight cluster with Azure DataLake Gen2 (preview)

It requires an User assigned managed identity (preview)
Gen1 is not available in UK and standard BLOB is not recommended for such volumes of data

Useful links for installing above services:

User assigned managed identity: https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-manage-ua-identity-portal
HDI w/ Gen2: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-connect-hdi-cluster

martinpeck commented 5 years ago

@martinpeck to share details on loading VCF to @PAOLT @PAOLT requires help from @j-coll to run the "export tool"

wbari commented 5 years ago

@PAOLT I found this documentation for export part, http://docs.opencb.org/pages/viewpage.action?pageId=15597876 may be It helps

lawrencegripper commented 5 years ago

@wbari found the following doc on the import/export process, this maybe something we can try in the meantime http://docs.opencb.org/pages/viewpage.action?pageId=15597876

lawrencegripper commented 5 years ago

I believe that this issue with be blocked by the bug seen here #1069. Until this is resolved one option would be to test with non-gen2 storage as we know this can work. However, this may not give us useful information if the performance is significantly different between the generations.

PAOLT commented 5 years ago

I went through this. Still getting run time errors when doing the transform, but the previous steps worked with some modifications that I'm sharing here below:

Test OpenCGA.docx

lawrencegripper commented 5 years ago

This PR: https://github.com/opencb/opencga/pull/1079 may resolve your runtime issues, if you have a chance to pull the changes and re-attempt the import with that build it would be useful to see if it resolves.

PAOLT commented 5 years ago

Solution

The below code allows to reproduce how to successfully export data from HBase to Azure Data Lake Gen2, by means of the existing export tool.

The tool that has been tried is here

Process to provision services

Services have been provisioned to UK West

Create a VM

Ubuntu Server 16.04 LTS
size D4s v3 (optional)
enable SSH
Accept the others defaults

Create a User Assigned Managed Identity

Create a Storage Account:

StorageV2 (general purpose v2)
Performance: Standard
Access Tier: Hot
Secure transfer: Enabled
Select the VM's V-Net
Hierarchical Namespace: Enabled
In the Storage Account settings (IAM), create a storage blob data contributor role, and assign it to the above User Assigned Managed Identity

Create an HDI cluster

Select custom (size, settings, apps)

Basics

Cluster type: HBase 1.1.2 (HDI 3.6)
Same region as the Storage Account

Security + Networking

Virtual Network: select the VM's VNet
Identity: select the user assigned managed identity

Storage

Primary storage type: Azure Data Lake Storage Gen2
Select the above SA for Select a storage account
Select the above User assigned managed identity for User Assigned Managed Identity

Cluster size: any

Used 4 region nodes D4v2

Configure VM

Follow instructions here

Run Setup resources section

Install the latest updates
Install Java SDK
Install Redis, Tomcat, Maven and Git
Install MongoDB 3.6
Install Solr

Clone OpenCGA and build it with Hadoop

git clone -b azure https://github.com/opencb/opencga.git cd opencga mvn clean install -DskipTests -Dstorage-hadoop -Popencga-storage-hadoop-deps -Phdp-2.6.5 -DOPENCGA.STORAGE.DEFAULT_ENGINE=hadoop

Continue with instructions here

Move OpenCGA to /opt/opencga
Create OpenCGA user and group to run Tomcat server
Copy OpenCGA build to Tomcat serve
Reboot the VM with sudo reboot
Run Azure HDInsight cluster (optional) to copy required Haddop config files to the VM
Skip Azure Data Lake dependencies

Test the application

The following statements consent to test the application with a VCF file

Install catalog (remember password for later use) sudo /opt/opencga/bin/opencga-admin.sh catalog install --secret-key SeCrEtKeY

cd to the application directory cd /opt/opencga

Start daemon for debugging sudo /opt/opencga/bin/opencga-admin.sh catalog daemon --start

Open a seperate terminal or use screen, and cd to the application directory cd /opt/opencga

Create a user (in the new terminal) sudo /opt/opencga/bin/opencga-admin.sh users create -u test --email test@gel.ac.uk --name "John Doe" --user-password testpwd

Login to get session token sudo /opt/opencga/bin/opencga.sh users login -u test

Create project sudo /opt/opencga/bin/opencga.sh projects create --id reference_grch37 -n "Reference studies GRCh37" --organism-scientific-name "Homo sapiens" --organism-assembly "GRCh37" Create a study within your project sudo /opt/opencga/bin/opencga.sh studies create --id 1kG_phase3 -n "1000 Genomes Project - Phase 3" --project reference_grch37

wget a vcf genome file wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Link vcf genome file (with absolute path to vcf file) sudo /opt/opencga/bin/opencga.sh files link -i /home/test/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -s 1kG_phase3

Transform -> view progress in DAEMON terminal sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir

Load -> view progress in DAEMON terminal sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --load -o outDir

Query: sudo /opt/opencga/bin/opencga-analysis.sh variant query --sample HG00096 --limit 100

Test the export tool

The export tool is stored here

Run the following from HDI head node. cd opencga export HADOOP_USER_CLASSPATH_FIRST=true hbase_conf=$(hbase classpath | tr ":" "\n" | grep "/conf" | tr "\n" ":") export HADOOP_CLASSPATH=${hbase_conf}:$PWD/libs/avro-1.7.7.jar:$PWD/libs/jackson-databind-2.6.6.jar:$PWD/libs/jackson-core-2.6.6.jar export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$PWD/libs/jackson-annotations-2.6.6.jar

Run the export tool with yarn jar ~/opencga/opencga-storage-hadoop-core-1.4.0-rc3-dev-jar-with-dependencies.jar org.opencb.opencga.storage.hadoop.variant.io.VariantExporterDriver opencga_test_reference_grch37_variants study test@reference_grch37:1kG_phase3 --of parquet_gz --output my.variants.parquet

martinpeck commented 5 years ago

@j-coll to test this solution once we have a larger/more significant set of data to test this with. Blocked until that's the case.

PAOLT commented 5 years ago

The above procedure has been tested successfully on the last committed Azure branch on Jan 31st. I tried to reproduce it on the Azure branch committed at the end of that day but I encountered some issues - most specifically the statement for making the transform

sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir

started correctly but once reached 65% it stopped working without any observed feedback in the catalog daemon standard output. Any tentative to restart it with --re didn't produce any related output on the std output scrolling.

opencb / opencga

Investigate moving data between HBase Clusters #961

Solution

Process to provision services

Create a VM

Create a User Assigned Managed Identity

Create a Storage Account:

Create an HDI cluster

Configure VM

Test the application

Test the export tool