jeff1evesque / ist-652

Syracuse IST-652 Final Project
1 stars 3 forks source link

Pipe scraped data to machine-learning #21

Open jeff1evesque opened 6 years ago

jeff1evesque commented 6 years ago

We need to pipe our scraped wikipedia + twitter data, to our machine-learning application.

jeff1evesque commented 6 years ago

Before we send our data payload to our endpoint, we need to restructure the data into an acceptable form. Therefore, each dataset instance, will need to be converted similar to below:

{
    "properties": {
            "session_name": "sample_svm_title",
            "collection": "svm-2",
            "dataset_type": "file_upload",
            "session_type": "data_new",
            "model_type": "svm",
            "stream": "True"
    },
    "dataset": [{
            "dependent-variable": "dep-variable-1",
            "independent-variables": [{
                "indep-variable-1": 23.45,
                "indep-variable-2": 98.01,
                "indep-variable-4": 325,
                "indep-variable-5": 54.64,
                "indep-variable-6": 0.002,
                "indep-variable-7": 23,
                "indep-variable-3": 0.432
            }]
        },
        {
            "dependent-variable": "dep-variable-4",
            "independent-variables": [{
                    "indep-variable-1": 22.1,
                    "indep-variable-2": 95.96,
                    "indep-variable-4": 342,
                    "indep-variable-5": 66.67,
                    "indep-variable-6": 0.001,
                    "indep-variable-7": 32,
                    "indep-variable-3": 0.743
                },
                {
                    "indep-variable-1": 20.71,
                    "indep-variable-2": 99.33,
                    "indep-variable-4": 342,
                    "indep-variable-5": 75.67,
                    "indep-variable-6": 0.001,
                    "indep-variable-7": 30,
                    "indep-variable-3": 0.648
                }
            ]
        },
        {
            "dependent-variable": "dep-variable-5",
            "independent-variables": [{
                    "indep-variable-1": 23.27,
                    "indep-variable-2": 95.03,
                    "indep-variable-4": 295,
                    "indep-variable-5": 55.83,
                    "indep-variable-6": 0.001,
                    "indep-variable-7": 27,
                    "indep-variable-3": 0.488
                },
                {
                    "indep-variable-1": 23.27,
                    "indep-variable-2": 95.03,
                    "indep-variable-4": 295,
                    "indep-variable-5": 55.83,
                    "indep-variable-6": 0.001,
                    "indep-variable-7": 27,
                    "indep-variable-3": 0.488
                },
                {
                    "indep-variable-1": 19.99,
                    "indep-variable-2": 97.78,
                    "indep-variable-4": 303,
                    "indep-variable-5": 58.88,
                    "indep-variable-6": 0.001,
                    "indep-variable-7": 29,
                    "indep-variable-3": 0.638
                }
            ]
        },
        {
            "dependent-variable": "dep-variable-3",
            "independent-variables": [{
                "indep-variable-1": 22.67,
                "indep-variable-2": 101.21,
                "indep-variable-4": 427,
                "indep-variable-5": 75.45,
                "indep-variable-6": 0.002,
                "indep-variable-7": 26,
                "indep-variable-3": 0.832
            }]
        }
    ]
}
jeff1evesque commented 6 years ago

Our current run.py execution yields the following traceback:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 137, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 67, in create_connection
    for res in socket.getaddrinfo(host, port, 0, socket.SOCK_STREAM):
  File "/usr/lib/python3.5/socket.py", line 732, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 787, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 217, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 146, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f454a51b438>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 273, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='True', port=8585): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f454a51b438>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 61, in <module>
    run(*argv[1:])
  File "run.py", line 57, in run
    port=port
  File "/home/ubuntu/ist-652/utility/wikipedia_scraper.py", line 78, in wikipedia_scraper
    data={'user[login]': username, 'user[password]': password}
  File "/usr/lib/python3/dist-packages/requests/api.py", line 107, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='True', port=8585): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f454a51b438>: Failed to establish a new connection: [Errno -2] Name or service not known',))
jeff1evesque commented 6 years ago

We need to test the following snippet:

                # get access token
                login = requests.post(
                    'https://{}:{}/login'.format(endpoint, port),
                    headers={'Content-Type': 'application/json'},
                    data={'user[login]': username, 'user[password]': password}
                )
                token = login.json['access_token']
                print('token: {}'.format(token))

If we cannot get the above working in a suitable amount of time, we can temporarily omit the above, since the current endpoint allows anonymous requests. However, this may not be a long term functionality. Additionally, we need to determine if incoming port rules need to be adjust. Possible cases include the http, and https protocol type.

jeff1evesque commented 6 years ago

When running the followingtest.py script:

import requests

username = 'jeff1evesque'
password = 'xxxxxxxxxx'
endpoint = '11.11.11.11'
port = 8585

login = requests.post(
    'https://{}:{}/login'.format(endpoint, port),
    headers={'Content-Type': 'application/json'},
    data={'user[login]': username, 'user[password]': password}
)
token = login.json['access_token']

print('token: {}'.format(token))

We receive the following traceback:

root@ubuntu-xenial:/vagrant# python3 test.py
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 787, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 252, in connect
    ssl_version=resolved_ssl_version)
  File "/usr/lib/python3/dist-packages/urllib3/util/ssl_.py", line 305, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib/python3.5/ssl.py", line 377, in wrap_socket
    _context=self)
  File "/usr/lib/python3.5/ssl.py", line 752, in __init__
    self.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 988, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 633, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 589, in urlopen
    raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    data={'user[login]': username, 'user[password]': password}
  File "/usr/lib/python3/dist-packages/requests/api.py", line 107, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
jeff1evesque commented 6 years ago

One probable problem is that the corresponding machine-learning application implements a self signed certificate. Additionally, all http requests are redirected to https. Several possible solutions exists, if implementing the api is still desired:

jeff1evesque commented 6 years ago

This problem is two fold problem:

Therefore, we'll temporarily merge the changes in this issue, and return to it if more time permits. In the meantime, we'll develop an else condition. This will generate a csv file, containing the article name, and the predicted article category.