config file and scripts for setting up the database container

SimonSK commented 6 years ago

This creates a mysql docker container that hosts our node database. The script assumes docker (and docker-compose), python2, and pip2 are installed. I prefer isolating the database in a container as I don't feel comfortable messing with mysql server hosted on the server. The actual database directory is stored directly on the host (or the nfs if that's what we set to) and attached to the container.

How to use:

Set connection information in config.py and ethnodes/.env.
Run sudo ./setup.sh

On a second thought, maybe I should just save the container image, push it to a private repo, and just pull it.

The database contains 4 different tables: neighbors, node_id_hash, devp2p_hello, eth_status.

neighbors: node addresses from the udp NEIGHBORS replies go here
node_id_hash: mapping between nodeid and its sha3 (keccak256). I haven't decided if I want to keep looking into the XOR distance, but I would like to keep this information for now.
devp2p_hello: DEVp2p Hello contents go here. Only the remote ephemeral port, counters, and timestamp fields get overwritten. If any of other fields changes, a new row is added.
eth_status: Ethereum Status contents go here. Only the dao-fork check, counter, and timestamp fields get overwritten. If any of other fields changes, a new row is added.

zzma commented 6 years ago

I think saving the container image and pulling it makes sense, instead of regenerating every time, especially if it takes significantly longer to regenerate the image.

SimonSK commented 6 years ago

@zzma it actually doesn't take that long (less than 30s). the script is basically pulling an existing mysql 5.7 image and runs my python script to create tables. i think we won't have to deal with this a lot once we have a database running, so i will leave it as is for now.

zzma commented 6 years ago

If it helps, we should also create index on node_id fields if we do a lot of lookups by node_id

SimonSK commented 6 years ago

Some thoughts on the tables:

Leave neighbors and node_id_hash tables as is. neighbors uses all fields as a composite primary key and counts number of occurrence for each entry if exact match. node_id_hash uses node_id as the primary key.
Combine devp2p_hello and eth_status tables into one. If any of the fields (except total difficulty and best hash) changes, we consider it a new record.
As @zzma suggested, the new combined table uses an auto-increment id field as the primary key.
Change all first_ts and last_ts field names to first_received_at and last_received_at
The table will be frequently queried using node_id field.
To index node_id field, i'm considering prefix indexing. According to https://dev.mysql.com/doc/refman/5.5/en/column-indexes.html, fulltext index is only supported by MyISAM. We could index by all 128 characters of node_id, but I'm not sure if that would be necessary. Should I consider only the first 16 characters or something?

I will create a separate PR for the data insert and search, but here is my current plan:

A data structure that keeps track of the most recent DEVp2p Hello and Ethereum Status contents and the next dial time. Use nodeId as the key.
If new information matches those in memory, search the database by the nodeId, get the entry with MAX(id), update the last_received_at and counter
If the information changed, update the information in memory, insert a new entry into the database.
Cases to consider:
1. Initially no information due to too many peers, then receives new information later
2. Information available, but receives too many peers on the next scan
3. Keeps receiving too many peers
I plan on logging DEVp2p and Ethereum handshake failure messages, but I'm also thinking of keeping a failure counter in the table as well. For case i, the entry will initially contain only the address information and the failure counter will be incremented, and later, the new information is added (i.e. overwrite entry iff the fields == NULL). For case ii, simply increment the failure counter. For case iii, keep incrementing the failure counter.
In case the client crashes or we simply restart it, the client first pulls the data from the database, grabbing entries with unique node_id, each one with MAX(id)

SimonSK commented 6 years ago

rename node_id_hash to node_properties and move counter fields to it

zzma commented 6 years ago

@SimonSK I think renaming node_id_hash to node_meta_info would be more appropriate.

teamnsrg / ethereum-p2p

config file and scripts for setting up the database container #12