heehehe / job-trend

[DE4E] 개발자 채용공고 데이터 추출 파이프라인 구축 및 응용 프로젝트
https://job-trend.streamlit.app
17 stars 2 forks source link

airflow dag 수정 #44

Open heehehe opened 4 months ago

heehehe commented 4 months ago

Resolve #13

firefox로 해결한 부분 - 현재 get_url_list에서 아래처럼 WebDriverException 오류 발생중인데요🥲, 따로 분리해서 처리하는게 좋을 것 같아서 먼저 리뷰 요청드려요! --> chrome은 driver랑 chrome 버전이 안맞아서 사용 못하고, firefox 이용해서 해결 가능!! ``` *** Found local files: *** * /opt/airflow/logs/dag_id=job_trend_daily/run_id=scheduled__2024-03-08T00:00:00+00:00/task_id=wanted.get_url_list/attempt=3.log [2024-03-09, 03:31:16 UTC] {taskinstance.py:1979} INFO - Dependencies all met for dep_context=non-requeueable deps ti= [2024-03-09, 03:31:16 UTC] {taskinstance.py:1979} INFO - Dependencies all met for dep_context=requeueable deps ti= [2024-03-09, 03:31:16 UTC] {taskinstance.py:2193} INFO - Starting attempt 3 of 3 [2024-03-09, 03:31:16 UTC] {taskinstance.py:2214} INFO - Executing on 2024-03-08 00:00:00+00:00 [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:60} INFO - Started process 423 to run task [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:87} INFO - Running: ['***', 'tasks', 'run', 'job_trend_daily', 'wanted.get_url_list', 'scheduled__2024-03-08T00:00:00+00:00', '--job-id', '19', '--raw', '--subdir', 'DAGS_FOLDER/deploy_daily.py', '--cfg-path', '/tmp/tmp9p2seab7'] [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:88} INFO - Job 19: Subtask wanted.get_url_list [2024-03-09, 03:31:16 UTC] {task_command.py:423} INFO - Running on host 43e3f6697550 [2024-03-09, 03:31:17 UTC] {taskinstance.py:2510} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='admin' AIRFLOW_CTX_DAG_ID='job_trend_daily' AIRFLOW_CTX_TASK_ID='wanted.get_url_list' AIRFLOW_CTX_EXECUTION_DATE='2024-03-08T00:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-03-08T00:00:00+00:00' [2024-03-09, 03:31:17 UTC] {taskinstance.py:2728} ERROR - Task failed with exception Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 444, in _execute_task result = _execute_callable(context=context, **execute_callable_kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable return execute_callable(context=context, **execute_callable_kwargs) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 200, in execute return_value = self.execute_callable() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 217, in execute_callable return self.python_callable(*self.op_args, **self.op_kwargs) File "/opt/airflow/dags/crawling.py", line 670, in get_url_list driver = self.driver() File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__ super().__init__( File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 50, in __init__ self.service.start() File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 102, in start self.assert_process_still_running() File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 115, in assert_process_still_running raise WebDriverException(f"Service {self._path} unexpectedly exited. Status code was: {return_code}") selenium.common.exceptions.WebDriverException: Message: Service /home/airflow/.cache/selenium/chromedriver/linux64/122.0.6261.111/chromedriver unexpectedly exited. Status code was: 127 [2024-03-09, 03:31:17 UTC] {taskinstance.py:1149} INFO - Marking task as FAILED. dag_id=job_trend_daily, task_id=wanted.get_url_list, execution_date=20240308T000000, start_date=20240309T033116, end_date=20240309T033117 [2024-03-09, 03:31:17 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 19 for task wanted.get_url_list (Message: Service /home/airflow/.cache/selenium/chromedriver/linux64/122.0.6261.111/chromedriver unexpectedly exited. Status code was: 127 ; 423) [2024-03-09, 03:31:17 UTC] {local_task_job_runner.py:234} INFO - Task exited with return code 1 [2024-03-09, 03:31:17 UTC] {taskinstance.py:3309} INFO - 0 downstream tasks scheduled from follow-on schedule check ```