Open martinpeck opened 5 years ago
Different approaches can be considered depending on the use case:
Different workloads might be generated by different user bases, as far as their reference use case changes:
Based on the above differentiation it is important to understand at what extent those categories require the same tooling (HBase).
Provided that researchers need HBase at least for a part of their job, their HBase concerned use case can be supported by the same HBase cluster used for clinical testing?
In case it can:
In case it cannot a new cluster should be provisioned, and the following aspects should be evaluated:
This process has the benefit of:
It is highly likely that researchers will need a different toolset – i.e. Spark or equivalent analytic platforms. Azure analytics platforms (i.e. HDInsight, Azure DataBricks, Azure Data Lake, Azure Datawarehouse) can share the same data lake (if any) with HBase and have good ways to source data from there.
So key questions would be: 1) there are 3 stages of transformation for genomics files - do they happen in HBase or outside? Does a raw data lake exists before HBase? What the data processing chain looks like? 2) which tooling does researchers need? 3) can they share the same HBase cluster + something different (i.e. Data Bricks)?
Some useful links:
Setting up Backup and Replication for HBase and Phoenix on HDInsight https://github.com/ZoinerTejada/hdinsight-docs/blob/master/hdinsight-hbase-backup-replication.md
Working with the HBase Import and Export Utility https://blogs.msdn.microsoft.com/data_otaku/2016/12/21/working-with-the-hbase-import-and-export-utility/
Copy Activity performance and tuning guide https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance
@PAOLT to have discussion with @j-coll @imedina
I discussed with @j-coll about this.
The data flow looks like the following:
Business requirements
How data is syncronized between the two clusters?
from a business perspective it is acceptable to sync the two environments 2-3 times / year. This is resulting in a very large batch, but it is affordable to wait for few days. Alternative would be to perform a dual-load, which is pretty typical on these kind of environments. Indeed, it would streamline the data loading process (fresh data available sooner + less critical sync process) at the price of changing the application. Given the benefits of dual-loading are not a requirement, it is good to go with the batch / occasional approach for the moment.
What Platform is going to be used for analytics?
Spark with Parquet files. Two flavours of Spark exist on Azure (beside installing dedicated VMs clearly):
Next steps
1) Verify Data Bricks peering with the current HBase set-up 2) put under the lens the large-batch mechanism for replicating data (it might be huge), provided Azure Data Lake is built from the ground up to provision very high IO throughput
Summary of phase 1 is:
strategy might be to go with a data lake shared among 2 clusters (HBase for clinical tests + Spark for researchers. Spark might be both HDInsight Spark or DataBricks)
the data lake can be based on ADL Store v1, BLOB or ADL Store v2 (preview). The latter is in preview and it is the convergence of ADL v1 and BLOB. Given the level of scale of the solution, ADL would be the way to go (IO throughput).
ADL v2 has a new driver to disintermediate HDFS APIs and ADL v2 REST APIs. This new driver is documented here https://jira.apache.org/jira/browse/HADOOP-15407 and here https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver. It should be used with the actual export solution
In addition a new driver for HBase has been released for Spark (here). This driver would let to write a batch running on a ephemeral cluster, that pull data from HBase and push Parquet files to the data lake.
Main differences between the two approaches would be the following - a test to validate them, evaluate impact more clearly and test performance would be required.
ABFS would require minimal changes to the actual solution (actual scope and effort TBD) vs. the HBase driver solution would require to write a brand new solution
the HBase driver solution would have a more concurrent usage of resources (pull data from HBase and process on another cluster vs process on the HBase cluster)
Here is @bart-jansen s guide on how to get the basic one-box setup running in Azure for testing https://github.com/opencb/opencga/tree/azure/opencga-app/app/scripts/azure
Tried the above guide on UK WEST, till application install steps - worked nice. Created an HDInsight cluster with Azure DataLake Gen2 (preview)
Useful links for installing above services:
@martinpeck to share details on loading VCF to @PAOLT @PAOLT requires help from @j-coll to run the "export tool"
@PAOLT I found this documentation for export part, http://docs.opencb.org/pages/viewpage.action?pageId=15597876 may be It helps
@wbari found the following doc on the import/export process, this maybe something we can try in the meantime http://docs.opencb.org/pages/viewpage.action?pageId=15597876
I believe that this issue with be blocked by the bug seen here #1069. Until this is resolved one option would be to test with non-gen2 storage as we know this can work. However, this may not give us useful information if the performance is significantly different between the generations.
I went through this. Still getting run time errors when doing the transform, but the previous steps worked with some modifications that I'm sharing here below:
This PR: https://github.com/opencb/opencga/pull/1079 may resolve your runtime issues, if you have a chance to pull the changes and re-attempt the import with that build it would be useful to see if it resolves.
The below code allows to reproduce how to successfully export data from HBase to Azure Data Lake Gen2, by means of the existing export tool.
The tool that has been tried is here
Services have been provisioned to UK West
Select custom (size, settings, apps)
Basics
Security + Networking
Storage
Cluster size: any
Follow instructions here
Run Setup resources section
Clone OpenCGA and build it with Hadoop
git clone -b azure https://github.com/opencb/opencga.git
cd opencga
mvn clean install -DskipTests -Dstorage-hadoop -Popencga-storage-hadoop-deps -Phdp-2.6.5 -DOPENCGA.STORAGE.DEFAULT_ENGINE=hadoop
Continue with instructions here
sudo reboot
The following statements consent to test the application with a VCF file
Install catalog (remember password for later use)
sudo /opt/opencga/bin/opencga-admin.sh catalog install --secret-key SeCrEtKeY
cd to the application directory
cd /opt/opencga
Start daemon for debugging
sudo /opt/opencga/bin/opencga-admin.sh catalog daemon --start
Open a seperate terminal or use screen, and cd to the application directory
cd /opt/opencga
Create a user (in the new terminal)
sudo /opt/opencga/bin/opencga-admin.sh users create -u test --email test@gel.ac.uk --name "John Doe" --user-password testpwd
Login to get session token
sudo /opt/opencga/bin/opencga.sh users login -u test
Create project
sudo /opt/opencga/bin/opencga.sh projects create --id reference_grch37 -n "Reference studies GRCh37" --organism-scientific-name "Homo sapiens" --organism-assembly "GRCh37"
Create a study within your project
sudo /opt/opencga/bin/opencga.sh studies create --id 1kG_phase3 -n "1000 Genomes Project - Phase 3" --project reference_grch37
wget a vcf genome file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Link vcf genome file (with absolute path to vcf file)
sudo /opt/opencga/bin/opencga.sh files link -i /home/test/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -s 1kG_phase3
Transform -> view progress in DAEMON terminal
sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir
Load -> view progress in DAEMON terminal
sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --load -o outDir
Query:
sudo /opt/opencga/bin/opencga-analysis.sh variant query --sample HG00096 --limit 100
The export tool is stored here
Run the following from HDI head node.
cd opencga
export HADOOP_USER_CLASSPATH_FIRST=true
hbase_conf=$(hbase classpath | tr ":" "\n" | grep "/conf" | tr "\n" ":")
export HADOOP_CLASSPATH=${hbase_conf}:$PWD/libs/avro-1.7.7.jar:$PWD/libs/jackson-databind-2.6.6.jar:$PWD/libs/jackson-core-2.6.6.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$PWD/libs/jackson-annotations-2.6.6.jar
Run the export tool with
yarn jar ~/opencga/opencga-storage-hadoop-core-1.4.0-rc3-dev-jar-with-dependencies.jar org.opencb.opencga.storage.hadoop.variant.io.VariantExporterDriver opencga_test_reference_grch37_variants study test@reference_grch37:1kG_phase3 --of parquet_gz --output my.variants.parquet
@j-coll to test this solution once we have a larger/more significant set of data to test this with. Blocked until that's the case.
The above procedure has been tested successfully on the last committed Azure branch on Jan 31st. I tried to reproduce it on the Azure branch committed at the end of that day but I encountered some issues - most specifically the statement for making the transform
sudo /opt/opencga/bin/opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --transform -o outDir
started correctly but once reached 65% it stopped working without any observed feedback in the catalog daemon standard output. Any tentative to restart it with --re didn't produce any related output on the std output scrolling.
Many clinical genomic projects have researchers associated to study all the genome data collected. In this scenario we can identify two main big use cases:
We need to investigate how we make the different big data snapshots available to researchers in an optimal way.
Option 1 - Replication OpenCGA implements export/import functionality. Data could be exported every 6 months and imported in a second installation in read-only mode. By doing this there are two copies of the data in different installations with different users and goals,
Option 2 - Only Spark OpenCGA can export data in Apache Parquet format to be analysed with Spark. OpenCB Oskar (https://github.com/opencb/oskar) implements a genomic analysis spark library and notebooks for researchers to analyse data. This option is easier to setup and there is less data replicated, but researcher will need to query the OpenCGA installation for some queries, there could be some problems between users.