kaizen-ai / kaizenflow

KaizenFlow is a framework for Bayesian reasoning and AI/ML stream computing
GNU General Public License v3.0
109 stars 76 forks source link

Spring2024_HBase_Secondary_Index #800

Open Minato132 opened 4 months ago

Minato132 commented 4 months ago

Write a code to implement a secondary index to a preselected database in order to view the efficiencies of a secondary index over the primary index in Apache HBase.

Minato132 commented 4 months ago

I want to update you on some complications and my methodologies around them. They will be separated into the methods that I will be separating in each comment.

Minato132 commented 4 months ago

Method 1: HBase with HappyBase

Here I have tried to look into the method of Secondary Indexing without the use of Apache Phoenix. This came off as a promising start. I was able to find a person who has created a distribution of Hbase with a working Thrift Server : https://github.com/dajobe/hbase-docker

My plan was to use HappyBase as the Python Library to talk with HBase to create this Secondary Index.

Unfortunately this was a naive approach as HappyBase does not support Secondary Indexing. The reasoning for this is due to the limitation of Thrift API. The api allows me to talk with Hbase, but only allows me to do full table scans, and basic query operations. The api does not manipulate the way a GET operation is handled, aside from basic filtering options.

Thus, this method cannot be supported.

Minato132 commented 4 months ago

Method 2: HBase with Coprocessor

On HBase documentation you can write a set of Java objects to create and manage a Coprocessor in order to control the GET operations in HBase.

This method however is extremely difficult as not only would I have to develop in Java, but I also have manipulate how HBase will interact with Hadoop.

Thus, this method is equally out of the question

Minato132 commented 4 months ago

Method 3: HBase with Apache Phoenix

The great developers at Apache have released a tool to deal with Secondary Indexing and streamlining the query process through Apache Phoenix.

However, setting up this service is difficult, not because of the nature of Apache Phoenix but because of integration with Python. Phoenix as a stand alone system does not allow for connection through an API. You have to manually set up a Query Server, which requires that HBase be set up in a pseudo-distributed mode, meaning in such a case Hbase and Hadoop need to configured where my system is 1 cluster.

Not even getting into containerizing such a service, the setup alone is difficult for the task as there is not that much documentation on how to get a Phoenix Query Server running.

There is a person online who has created such a service however : https://github.com/milnomada/docker-hbase-phoenix

The main issue with this is that even though you have the server up and running, Apache did not think to have protocol buffers built into the system. Meaning that the only way to get a connection to Python, you would have to use another Apache service, Zepplin, in order to create a protocol buffer and finally connect Python to the system.

Apache Zepplin, is a service that cannot just be ran and automatically configured in a docker system, and it is a program that you must run with a GUI.

Again, the approach here is too complicated

Minato132 commented 4 months ago

Method 4: HBase with Phoenix, but without the Phoenix Query Server

Instead of using Phoenix Query Server in order to establish a connection through Python to run queries, the system will just use Phoenix as a stand alone system. Meaning, in order to run queries, the bash shell in the container needs to be opened, and that is where queries and commands will be ran.

In this method, I can take advantage of Thrift in order to load in data through Python, but the query methods that will be ran must be in the Phoenix bash shell.

Working with Hbase in this manner will be alot easier as all I would need to do is to modify the dajobe/hbase image introduced in Method 1, by writing a bash script that you can run which will inject Phoenix along with the necessary python packages in order to run the system.

This is the method that I have chosen to continue with, as all my research into the previous method have lead me to either dead ends, or needless complications

Edit: So it seems that the Phoenix server cannot be ran inside of the container, as it is unable to connect to the hbase inside of the system. I will try to work around this but at the moment it does not seem promising.

Minato132 commented 4 months ago

Method 5: HBase without Containerization

Here I have decided to just do a manual installation of HBase without containerization, meaning that all the installation and the process will be done locally, until I can find a way to get Phoenix to work on a container.

If this is the method that I will proceed with then the documentation on the process will contain all instructions, and the video will be the additional demo