thegreyd / dic17

Repo for Data Intensive Computation 2017, NCState
GNU General Public License v3.0
0 stars 0 forks source link

Technologies #1

Open thegreyd opened 6 years ago

thegreyd commented 6 years ago

Akanksha

Sid

Sachin

Todo:

AkankshaNitw commented 6 years ago

Hive seems like a better option than Google BigQuery. Check link- https://db-engines.com/en/system/Google+BigQuery%3BHive

AkankshaNitw commented 6 years ago
  1. Apache Hadoop- It is an open-source software framework used for distributed storage and processing of big data using the MapReduce programming model. It is composed of the following modules: Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce http://hadoop.apache.org/

  2. Apache Spark- It is a fast and general engine for large-scale data processing. 100x faster than Hadoop MapReduce. Runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. https://spark.apache.org/

  3. Apache Kafka- It is a distributed streaming platform used for building real time streaming data pipelines and applications. Use cases- website activity tracker, news articles recommendation. https://kafka.apache.org/

  4. NoSQL DBs- An approach to database design that implements a key-value store, document store, column store or graph format for large sets of distributed data. Used when requirements for performance and scalability outweigh the need for the immediate, rigid data consistency that RDBMS provides.

  5. Hadoop vs NoSQL- Both good at handling large, distributed datasets. Hadoop is preferred for batch processing. NoSQL is better suited for real time processing.