데이터 수집 단계 refactoring

Scraping
- [ ] aws lambda 함수 작성 (데이터 스크래핑 스크립트)
  - [x] 스크래핑 스크립트 정상 작동 로컬 테스트
  - [x] aws lambda docker 로컬 테스트
    - [ ] 스크립트 자체는 동작되는 것 테스트 완료, 코드 모듈화, 분리 단위 어떻게 잡아야할지 더 고민해봐야할듯
    - [ ] 람다 실행시간 15분, 수집하는 책 카테고리가 183개, 각 카테고리 당 top 60개 도서 수집하니까 10980번 redirecting된다는건데, 네트워크 시간, 랜덤 wait, 데이터 읽어들여서 rds로 쏘는 시간 고려하면 15분으로는 택도 없음
    - [ ] 카테고리를 적절히 분배해서 여러 람다로 돌려야할 것 같은데, 어떻게 분리해야할지 고민...
DB setting
- [x] ~관계형db~DynamoDB 하나 세팅해서 여기에 수집
  - ~aws lambda 에 도커 컨테이너로 mysql 세팅하는게 나을까? vs AWS RDS 쓰는게 나을까?~ -> 람다 이해를 제대로 못하고있었던 듯.. 함수 실행 환경을 도커라이징하는 것..!
  - ~람다에 mysql 컨테이너 배포하면 어차피 aws ecr(컨테이너 레지스트리) 써야하고 저장 공간에 따라 과금되긴한다~
  - container : lambda + ECR
  - solution : lambda + RDS
  - RDS가 12개월 무료, dynamo가 무료라서 dynamo로 변경
  - docker lambda + DynamoDB
trigger setting
cloud - local 네트워크 연결, 보안 설정

aws lambda blueprints

basic structure

import json

print('Loading function')

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))
    print("value1 = " + event['key1'])
    print("value2 = " + event['key2'])
    print("value3 = " + event['key3'])
    return event['key1']  # Echo back the first key value
    #raise Exception('Something went wrong')

create a microservice that interacts with a DDB table

simple backend(read, write to DynamoDB) + restful API endpoint (Amazon API Gateway)

import boto3
import json

print('Loading function')
dynamo = boto3.client('dynamodb')

def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'body': err.message if err else json.dumps(res),
        'headers': {
            'Content-Type': 'application/json',
        },
    }

def lambda_handler(event, context):
    '''Demonstrates a simple HTTP endpoint using API Gateway. You have full access to the request and response payload, including headers and status code.

    To scan a DynamoDB table, make a GET request with the TableName as a query string parameter. 
    To put, update, or delete an item, make a POST, PUT, or DELETE request respectively, passing in the payload to the DynamoDB API as a JSON body.
    '''
    #print("Received event: " + json.dumps(event, indent=2))

    operations = {
        'DELETE': lambda dynamo, x: dynamo.delete_item(**x),
        'GET': lambda dynamo, x: dynamo.scan(**x),
        'POST': lambda dynamo, x: dynamo.put_item(**x),
        'PUT': lambda dynamo, x: dynamo.update_item(**x),
    }

    operation = event['httpMethod']
    if operation in operations:
        payload = event['queryStringParameters'] if operation == 'GET' else json.loads(event['body'])
        return respond(None, operations[operation](dynamo, payload))
    else:
        return respond(ValueError('Unsupported method "{}"'.format(operation)))

batch job

submit an aws batch jon and returns the jobid

import json
import boto3

print('Loading function')

batch = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get parameters for the SubmitJob call
    # http://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html
    jobName = event['jobName']
    jobQueue = event['jobQueue']
    jobDefinition = event['jobDefinition']
    # containerOverrides and parameters are optional
    if event.get('containerOverrides'):
        containerOverrides = event['containerOverrides']
    else:
        containerOverrides = {}
    if event.get('parameters'):
        parameters = event['parameters']
    else:
        parameters = {}

    try:
        # Submit a Batch Job
        response = batch.submit_job(jobQueue=jobQueue, jobName=jobName, jobDefinition=jobDefinition,
                                    containerOverrides=containerOverrides, parameters=parameters)
        # Log response from AWS Batch
        print("Response: " + json.dumps(response, indent=2))
        # Return the jobId
        jobId = response['jobId']
        return {
            'jobId': jobId
        }
    except Exception as e:
        print(e)
        message = 'Error submitting Batch Job'
        print(message)
        raise Exception(message)

returns the current status of an aws batch job

import json
import boto3

print('Loading function')

batch = boto3.client('batch')

def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2)) # Log the received event
    jobId = event['jobId'] # Get jobId from the event

    try:
        # Call DescribeJobs
        response = batch.describe_jobs(jobs=[jobId])
        # Log response from AWS Batch
        print("Response: " + json.dumps(response, indent=2))
        # Return the jobStatus
        jobStatus = response['jobs'][0]['status']
        return jobStatus
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Selenium in Lambda

available options

on-premise(literally in my own labtop. local environment)
EC2 - kinda on-premise but in the cloud environment
ECS - Elastic Container Servcie.
- independence
- docker file
- advantage to running it on aws ? integration with other services
Lambda
- monthly free tier
- execution time = 15mis
- need to be split the whole scraping task into subtasks, run them in parallel
  - or fall back to #3
- lambda gives you access to python built-in functions by default,
- or you can build your own package https://docs.aws.amazon.com/lambda/latest/dg/python-package.html
- or use Lambda Layers

what is needed

dockerfile 프로젝트에 추가해두고 빌드해서 허브에 올려놓기
- 매번 빌드 단계 -> 도커 허브로 올리고 이 이미지 베이스로 dockerfile 간결버전 재작성

[x] lambda with docker
- [x] 로컬 테스트로 chrome web driver, selenium 동작 확인
- [x] 데이터 잘 fetch해오는지 테스트
- [x] 람다 배포 동작 확인
[x] load data to aws RDS
- [x] RDS 인스턴스 생성
- [x] save returned scrapped data to RDS - 연결

seoyeong200 / Book-data-Pipeline

데이터 수집 단계 refactor #2

aws lambda blueprints

basic structure

create a microservice that interacts with a DDB table

batch job

Selenium in Lambda