jeff1evesque / ist-664

Syracuse IST-664 Final Project with Chris Wilson (team member)
2 stars 3 forks source link

Insert dataset(s) into mongodb #12

Closed jeff1evesque closed 6 years ago

jeff1evesque commented 6 years ago

We need to add logic to allow our dataset(s) to be stored into mongodb.

jeff1evesque commented 6 years ago

Before attempting to write python logic to insert, we tried to manually enter commands on port 27019:

>>> client = MongoClient('xxx-xxx-xxx-xxx:27019')
>>> db = client.test_database
>>> posts = db.posts
>>> post_id = posts.insert_one({'first': 'jeff', 'last': 'levesque'}).inserted_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 693, in insert_one
    session=session),
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 607, in _insert
    bypass_doc_val, session)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 595, in _insert_one
    acknowledged, _insert_command, session)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/mongo_client.py", line 1248, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/mongo_client.py", line 1201, in _retry_with_sess                                                              ion
    return func(session, sock_info, retryable)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 592, in _insert_command
    _check_write_command_response(result)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/helpers.py", line 217, in _check_write_command_r                                                              esponse
    _raise_last_write_error(write_errors)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/helpers.py", line 199, in _raise_last_write_erro                                                              r
    raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: can't create user databases on a --configsvr instance

Since this did not work, we inspect the running ports on the mongos instance, using netstat -nltup. We notice that port 27017 is also being used by mongo. Therefore, we attempt to insert on this port:

>>> from pymongo import MongoClient
>>> client = MongoClient('xxx-xxx-xxx-xxx:27017')
>>> db = client.test_database
>>> posts = db.posts
>>> post_id = posts.insert_one({'first': 'jeff', 'last': 'levesque'}).inserted_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 693, in insert_one
    session=session),
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 607, in _insert
    bypass_doc_val, session)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 595, in _insert_one
    acknowledged, _insert_command, session)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/mongo_client.py", line 1248, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/mongo_client.py", line 1201, in _retry_with_sess                                                              ion
    return func(session, sock_info, retryable)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 590, in _insert_command
    retryable_write=retryable_write)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/pool.py", line 579, in command
    unacknowledged=unacknowledged)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/network.py", line 142, in command
    unpacked_docs = reply.unpack_response(codec_options=codec_options)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/message.py", line 1418, in unpack_response
    self.raw_response(cursor_id)
  File "/usr/local/lib/python3.5/dist-packages/pymongo/message.py", line 1398, in raw_response
    error_object)
pymongo.errors.OperationFailure: database error: error creating initial database config information ::                                                               caused by :: socket exception [CONNECT_ERROR] for rs1/ip-172-31-34-158.ec2.internal:27018,ip-172-31-38-                                                              98.ec2.internal:27018,ip-172-31-40-241.ec2.internal:27018
jeff1evesque commented 6 years ago

At first glance, the insert statements seem to be working from a local instance:

root@ubuntu-xenial:/home/vagrant# python3 insert.py
post_id: 5bda655f076129444aa25e1f
root@ubuntu-xenial:/home/vagrant# cat insert.py
from pymongo import MongoClient
client = MongoClient('xxx.xxx.xxx.xxx:27017')
db = client.test_database
mycol = db.col_1
post_id = mycol.insert_one({'first': 'jeff', 'last': 'levesque'}).inserted_id
print('post_id: {}'.format(post_id))
root@ubuntu-xenial:/home/vagrant#
root@ubuntu-xenial:/home/vagrant#
root@ubuntu-xenial:/home/vagrant#
root@ubuntu-xenial:/home/vagrant#
root@ubuntu-xenial:/home/vagrant# python3 select.py
collection names: []
root@ubuntu-xenial:/home/vagrant# cat select.py
from pymongo import MongoClient
client = MongoClient('xxx.xxx.xxx.xxx:27017')
print('collection names: {}'.format(client.dh.collection_names()))

However, on the mongos machine, the logs indicate some kind of distributed lock/unlocking:

2018-11-01T02:27:04.413+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda6478065e8d3e63923c2d
2018-11-01T02:27:04.415+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:09.842+0000 [LockPinger] cluster xxx.xxx.xxx.xxx:27019 pinged successfully at Thu Nov  1 02:27:09 2018 by distributed lock pinger 'xxx.xxx.xxx.xxx:27019/xxx.xxx.xxx.xxx:27017:1541038299:1804289383', sleeping for 30000ms
2018-11-01T02:27:10.417+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda647e065e8d3e63923c2e
2018-11-01T02:27:10.418+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:16.420+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda6484065e8d3e63923c2f
2018-11-01T02:27:16.422+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:22.424+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda648a065e8d3e63923c30
2018-11-01T02:27:22.425+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:28.428+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda6490065e8d3e63923c31
2018-11-01T02:27:28.429+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:34.431+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda6496065e8d3e63923c32
2018-11-01T02:27:34.433+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:39.843+0000 [LockPinger] cluster xxx.xxx.xxx.xxx:27019 pinged successfully at Thu Nov  1 02:27:39 2018 by distributed lock pinger 'xxx.xxx.xxx.xxx:27019/xxx.xxx.xxx.xxx:27017:1541038299:1804289383', sleeping for 30000ms
2018-11-01T02:27:40.435+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda649c065e8d3e63923c33
2018-11-01T02:27:40.436+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:46.439+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda64a2065e8d3e63923c34
2018-11-01T02:27:46.440+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:52.442+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda64a8065e8d3e63923c35
2018-11-01T02:27:52.444+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:27:58.446+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda64ae065e8d3e63923c36
2018-11-01T02:27:58.447+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
2018-11-01T02:28:04.450+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' acquired, ts : 5bda64b4065e8d3e63923c37
2018-11-01T02:28:04.451+0000 [Balancer] distributed lock 'balancer/xxx.xxx.xxx.xxx:27017:1541038299:1804289383' unlocked.
jeff1evesque commented 6 years ago

Our ealier test with the custom select.py was incorrectly defined. With some adjustments, we have verified that our replicated mongodb is capable of writing, then reading the corresponding data:

root@ubuntu-xenial:/home/vagrant# python3 select.py
collection names: {'last': 'levesque', '_id': ObjectId('5bda64220761294433e3370a'), 'first': 'jeff'}
collection names: {'last': 'levesque', '_id': ObjectId('5bda653c076129444056cf69'), 'first': 'jeff'}
collection names: {'last': 'levesque', '_id': ObjectId('5bda655f076129444aa25e1f'), 'first': 'jeff'}
root@ubuntu-xenial:/home/vagrant# cat select.py
from pymongo import MongoClient
client = MongoClient('xxx.xxx.xxx.xxx:27017')

db = client.test_database
mycol = db.col_1.find()
for col in mycol:
    print('collection names: {}'.format(col))
jeff1evesque commented 6 years ago

I'll probably test the above scripts tomorrow, and see if the reddit data is stored into the mongo shard.

jeff1evesque commented 6 years ago

6c422ad: after execution of upload.py:

root@ubuntu-xenial:/home/vagrant/ist-664# python3 upload.py
Traceback (most recent call last):
  File "upload.py", line 49, in <module>
    post_id = col.insert_many(data).inserted_id
AttributeError: 'InsertManyResult' object has no attribute 'inserted_id'
jeff1evesque commented 6 years ago

d473d98: the following indicates many documents were inserted:

root@ubuntu-xenial:/home/vagrant/ist-664# python3 check.py
document count: 147288
root@ubuntu-xenial:/home/vagrant/ist-664# cat check.py
from pymongo import MongoClient
from config import (
    mongos_endpoint,
    mongos_port,
    database,
    collection
)

client = MongoClient('{}:{}'.format(
    mongos_endpoint,
    mongos_port
))

# database + collection
db = client[database]
col = db[collection]

print('document count: {}'.format(col.count_documents({})))

Each document could be queried using the findall({}), and produce a long traceback. For this reason, the corresponding output is not shown above.