kevhen / CryptoCrawler

1 stars 2 forks source link

CryptoCrawler - Crawling information about Crypto Currencies from the Web, analyze them and present them in a Web-Dashboard. A project from a course at university of media, stuttgart.


Table of Contents

Documentation & Presentation

Documentation

Presentation

Dashboard

Architecture

Microservice Architecture

Hosting

Setup AWS

VM Setup

Server Setup

Prepare & mount EBS Drive

Prepare Docker

Microservices

Microservice 1: Mongo DB

Description:

Optimize:

We do lot's of queries based on "timestamp". Let's create an index on this field:

Also run on host:

echo 1 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Microservice 2: Twitter Stream Listener

Description:

Microservice 3: Crypto Price Crawler

Description:

Microservice 4: Crypto Api Wrapper

Description:

Access the API:

URL: AWS public DNS-Name/container name + :8060

GET: /price

URL-Parameters:

Parameter Values Default Description
coin ETH, BTC, IOT BTC Defines the coin for which the prices will be retrieved
currency EUR, USD EUR Defines the currency in which the coin price will be returned
from timestamp in ms - Start of the requested timespan
to timestamp in ms current time End of the requested timespan
step day, hour, minute day Step size between two returned price values

example: http://********:8060/price?coin=BTC&currency=EUR&from=1516974329398&to=1516974379822&step=day

GET: /tweets

URL-Parameters:

Parameter Values Default Description
amount amount of tweets as an int 20 Defines the amount of tweets that are retrieved
topics ethereum,bitcoin,iota bitcoin Comma separated list of topics for which the tweets are returned
from timestamp in s - Start of the requested timespan
to timestamp in s current time End of the requested timespan

example: http://********:8060/tweets?amount=30&topics=ethereum,bitcoin&from=1516974329&to=1516974379

Microservice 5: Jupyter Notebook

Description:

Access Notebook:

Microservice 6: Dashboard

Description:

View Dashboard:

Microservice 7: LDA Topic Identification

Description:

Query:

Microservice 8: Anomaly Detection

Description:

Query:



## Microservice 9: Sentiment Analysis

**Description:**
- Used to add sentiment information to all tweets in MongoDB
- Constently queries MongoDB for Tweets without sentiment/score (every 30sec)
- Calculates sentiment & score and stores them to MongoDB
- Very simple algo: Just look for pos/neg words from [a well known list for financial sentiment analysis](https://www3.nd.edu/~mcdonald/Word_Lists.html).
- The **Score** value is: (count of positive words) - (count of negative words)
- **Sentiment** can be "neg" (score < 0), "pos" (score > 0) or "neu" (score = 0)

# Useful info & commands

## Docker

**Cleanup Docker:**

- `docker system prune -a`

**Attach/Detach Container:**

- `docker attach container_name`
- Detach without closing: `CTRL + p, CTRL +q`
- Bash into container: `docker exec -it container_name /bin/bash`

**Connect to MongoDB in Container from Host:**

- Find out IP address of mongo-container: `docker inspect $CONTAINER_NAME | grep IPAddress`
- Use that IP-Address in MongoDB Client

## Docker Compose

**Build & run containers in background:**

- `docker-compose up -d`

**See output of containers:**

- `docker-compose logs -f` for all output or
- `docker-compose logs -f $CONTAINER_NAME` for output of some containers

**Force Rebuild all containers**

- `docker-compose build --no-cache`

## Maintenance

**Show size of MongoDB Directory:**

- `sudo du -sh /data/mongodb`

**Show Top 10 largest directories:**

- `du -a / | sort -n -r | head -n 10`

# Issues

**Things that could be improved, if we had more time:**

- Data Loading for Dash is not efficient. If multiple users connect to Dash, performance goes down a lot.