pete-wn commented 8 years ago

This issue will serve as a placeholder for the plans to implement the new Infrastructure required for production deployment of both the ExileTools v4 Indexer as well as the associated production infrastructure.

This will give interested parties additional insight into the current production environment as well as illustrating how the organic growth of ExileTools and the adoption of various tools over time has led to a inefficient infrastructure.

Current Infrastructure Overview

Hardware / Software

pwx

Primary Server
64GB RAM, 16 cpu cores
~800GB of compressed + deduplicated data on NFS

Back End Services include:

MariaDB (only used for Ladder API and league information at this time)
Jenkins (manages execution of Ladder API updates)
ElasticSearch Cluster Master (stats, items, and other indexes)
All Indexer related pipeline tools, such as river-watch, etc.

Front End Services include:

Apache, serves ALL primary web content for exiletools.com as well as the older Ladder API front end
Varnish (all requests to Apache go through Varnish for front end caching)
pwx2
- Secondary Server
- 64GB RAM, 16 cpu cores

Back End Services include:

ElasticSearch Cluster Secondary. Kibana is hosted here as well.
Tyk.io API Gateway (handles incoming API requests by firing them off at localhost:9200 for ElasticSearch). Redis and MongoDB are used by Tyk.

Front End Services include:

nginx for any incoming API requests. nginx modifies basic http auth into something Tyk can read as well as provides a short front-end cache for all API requests (including POSTS) based on checksums of the request payload.
pwx3
- Secondary Server, ES only
- 48GB RAM, 12 cpu cores

Back End Services include:

ElasticSearch Cluster member. Only purpose for this server, by adding another 30GB JVM it increases the amount of items stored in memory on the shards.
haproxy
- Small VM running front end software load balancer instance which provides SSL offload as well as routing to either Varnish on pwx for primary requests or Nginx on pwx2 for ES Index requests. All external inbound traffic first passes through this system.
  ElasticSearch
Runs on NFS - slow!
Three 30GB JVM's spread across three machines, with each index having two shards per JVM, means the indexes can hold ~40GB of data in memory between them while still supporting full redundancy in the event of node failure. This is not enough memory to manage a long running item index.
Internal programs access the ES cluster directly, while external programs must go through haproxy->nginx->tyk->ES.
The Future / Planned Changes

These changes will be deployed into production at some point after or while deploying the v4 Indexer Pipeline. I may wait to make full infrastructure stack changes until the Perandus leagues are over. I will announce my plans on twitter when they are nailed down.

New Backend Machine: 96GB RAM, 24 cpu cores, with 2x240GB SSD's. This will be the only ElasticSearch machine moving forward, with two JVM's running on it. It will be a single point of failure, but the performance gains are worth it. This machine will also run Kafka and the entire v4 Indexer Pipeline.
New Hybrid Services Machine: Probably a small 16GB VM with 8 cores or so. This machine will run the Tyk API Gateway (2.0), Apache, and a very small MariaDB instance for the ladder (which really needs to be transitioned to ElasticSearch).
New Front-end Machine: Probably a small 16GB VM with 8 cores or so. This machine will run haproxy as well as a Varnish cache for Apache and an nginx cache for Tyk (why nginx? it's just sooo much easier to handle caching for POST requests in nginx).
Other Big Changes

I think that when I move to Tyk 2.0 I will drop the requirement for API keys and authorization headers and just fully open the index - instead, I will just apply rate limiting by end user IP address. The main reason I originally asked for people to sign up for API keys was so that I had a list of the userbase, but at this point there are something like 500+ API keys out there and less than 20 are in active daily use, so it's not very useful or accurate.

I'm also considering picking up some "new" hardware off ebay instead of just deploying on one primary box. Let's see where this goes first though.

pete-wn commented 8 years ago

The first stage of this is planned for release tonight/tomorrow, prior to the Perandus Flashback events.

It consists of a single decided node with two enterprise grade SSD's running only one ElasticSearch index. I did a significant amount of testing with multiple nodes running via docker and did not see a performance gain that justified it. The downside is that with only one node, a crash can cause serious problems, but this isn't an enterprise grade system and I've only had elasticsearch crash once in the past two years.

The upside is that the query performance in testing is significantly faster.

Deployment plan is as follows:

Create new production index
Re-index all data from current master to new index (this may take 5-6+ hours)
Shift Tyk endpoint to new master

No other changes should be necessary for the first stage.

thirdy commented 8 years ago

The index seems to be lagging behind significantly. I've been constantly seeing items that are sold already:

This gloves is still present at the time of this writting:

{
  "_index": "poe20160505",
  "_type": "item",
  "_id": "58ff1ebd84c8b1d69f02eb29b19d50b08c64046fac4fd383df06437a6d759e06",
  "_score": null,
  "_source": {
    "shop": {
      "stash": {
        "xLocation": "3",
        "inventoryID": "Stash3",
        "yLocation": "9",
        "stashName": "~price 1 chaos",
        "stashID": "88b5b2566412ad22ed9d457dfa420f6e5aae694d9a615b1184ea550f6e79cc68"
      },

Search:

{
  "index": "index",
  "sort": [
    "shop.chaosEquiv:asc"
  ],
  "from": 0,
  "size": 20,
  "body": {
    "query": {
      "filtered": {
        "filter": {
          "bool": {
            "must": {
              "query_string": {
                "default_operator": "AND",
                "query": "( attributes.itemType:Gloves modsPseudo.+#%\\ Total\\ to\\ Fire\\ Resistance:[26 TO *] modsPseudo.+#%\\ Total\\ to\\ Cold\\ Resistance:[19 TO *] modsPseudo.+#\\ Total\\ to\\ maximum\\ Life:[70 TO *] requirements.Level:<=54 (shop.hasPrice:false OR shop.hasPrice:true) ) attributes.league:\"Perandus Flashback HC\" shop.verified:YES"
              }
            }
          }
        }
      }
    }
  }
}

pete-wn commented 8 years ago

On my end, the server is processing the API faster than it comes down (i.e. the processing threads are often waiting up to a few seconds for new updates to come in). Items are showing up within ~10s of hitting the API update stream, which is the refresh value of the index.

Overall the API stream is pretty relaxed at the moment, with ~10-30 stashes coming through every 3s.

If items aren't getting all updated properly, I can only assume something is going on with the API itself. This jives with the fact that poe.trade seemed VERY slow to update yesterday as well. Right now the API looks totally fine, but it may have been glitching out at some point over the weekend.

pete-wn commented 8 years ago

I'm going to close this out for now and review in the future.

pete-wn / exiletools-indexer

v4 New ExileTools Infrastructure #135

Current Infrastructure Overview

Hardware / Software

pwx

pwx2

pwx3

haproxy

ElasticSearch

The Future / Planned Changes

Other Big Changes