Closed jacobfan closed 5 years ago
上周,jacobfan 说今天会分析好药师这边推荐准确性的问题,能否给个准确的时间点啊
@brook-peng 今天晚上我就会处理,看看是哪里导致 batch server 失效。
在 /home/pocoweb/log/supervisor_celeryd.log
中,会找到如下的日志:
[2014-12-11 13:03:56,326: ERROR/MainProcess] Task apps.apis.recommender.tasks.write_log[b45251ce-5c27-4feb-80bd-6b0faff65874] raised unexpected: InvalidStringData("strings in documents must be valid UTF-8: 'http://www.ehaoyao.com/search/\\xb6\\xc5\\xc0\\xd9\\xcb\\xb9?cts=613'",)
Traceback (most recent call last):
File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
return self.run(*args, **kwargs)
File "/home/pocoweb/cube/apps/poco/poco/apps/apis/recommender/tasks.py", line 69, in write_log
_write_log(site_id, content)
File "/home/pocoweb/cube/apps/poco/poco/apps/apis/recommender/tasks.py", line 57, in _write_log
mongo_client.writeLogToMongo(site_id, content)
File "/home/pocoweb/cube/apps/poco/poco/common/mongo_client.py", line 506, in writeLogToMongo
c_raw_logs.insert(content)
File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/pymongo/collection.py", line 402, in insert
gen(), check_keys, self.uuid_subtype, client)
InvalidStringData: strings in documents must be valid UTF-8: 'http://www.ehaoyao.com/search/\xb6\xc5\xc0\xd9\xcb\xb9?cts=613'
其中的 http://www.ehaoyao.com/search/\xb6\xc5\xc0\xd9\xcb\xb9?cts=613
编码并不是utf8,而是gbk:
# echo '00000000: b6c5 c0d9 cbb9' | xxd -r | iconv -f gbk -t utf8
杜蕾斯
我看了 好药师 网页的编码都是 utf8
, 但是通过api传过来的 refer、url 编码都变成了 gbk 编码。
所以有两个解决思路:
目前我还没有找到处问题的 url 是如何写入到 log 中的。
@brook-peng hi 小刀,我们已经提了pull request,等review 完成之后就可以上线修复这个问题。
好的,谢谢了
@jacobfan
更改临时路径后,只有一个任务计算出了问题:
Traceback (most recent call last):
File "/cube/apps/poco/poco/services/batch/server.py", line 155, in _execJob
callable()
File "/cube/apps/poco/poco/services/batch/server.py", line 202, in do_hive_based_calculations
connection, SITE_ID, self.getWorkDir(), backfilled_raw_logs_path)
File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 633, in hive_based_calculations
do_calculations(connection, site_id, work_dir, backfilled_raw_logs_path, client)
File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 615, in do_calculations
calc_click_rec_buy(site_id, connection, client)
File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 52, in wrapped_function
result = function(*arg, **kws)
File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 453, in calc_click_rec_buy
client.execute("INSERT OVERWRITE TABLE rec_buy "
File "pylib/hive_service/ThriftHive.py", line 63, in execute
self.recv_execute()
File "pylib/hive_service/ThriftHive.py", line 84, in recv_execute
raise result.ex
HiveServerException: HiveServerException(errorCode=9, message='Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask', SQLState='08S01')
2014-12-14 03:47:43|cd23526b-7e5b-4da5-b8a9-f6aefe4f2fc9|INFO|root|FlowEnd: HiveBasedStatisticsFlow
我去google了一下那个错误码 08S01
,有提到网络问题的,有提到内存配置问题的(http://gtzygtzygtzy.blog.163.com/blog/static/124198088201364734321/). 从我们的日志看,出现了broken pipe
, 可能是网络出现了问题:
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:576:2014-12-14 03:47:43,160 INFO exec.ScriptOperator (ScriptOperator.java:processLine(518)) - ErrorStreamProcessor calling reporter.progress()
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:578:2014-12-14 03:47:43,163 ERROR exec.ScriptOperator (ScriptOperator.java:processOp(343)) - Error in writing to script: Broken pipe
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:579:2014-12-14 03:47:43,163 INFO exec.ScriptOperator (ScriptOperator.java:run(565)) - StreamThread ErrorProcessor done
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:580:2014-12-14 03:47:43,164 INFO exec.ScriptOperator (ScriptOperator.java:displayBrokenPipeInfo(230)) - The script did not consume all input data. This is considered as an error.
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:582:2014-12-14 03:47:43,172 FATAL ExecReducer (ExecReducer.java:reduce(267)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:611:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:616:Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:641:2014-12-14 03:47:43,514 ERROR exec.ExecDriver (SessionState.java:printError(343)) - Ended Job = job_local_0001 with errors
这个问题不确定是不是每次都出现,我再观察一天看看。
Status: RUNNING(16 hours 23 seconds since the beginning of calculation.)
服务器时间不对,开始batch计算的瞬间已经16个小时的误差了。
@sunliwen 今天早上,显示计算了7个小时,所以你看到的应该是真的运行了16h。
@jacobfan 我看了一下batch 出现异常的点,各种记录都会导致,但是相同的是,它们的 req_id
都是 0 ,不知道有没有什么关联。
[pocoweb@en83:/cube/services/hive/logs/hive-server]$ pwd
/cube/services/hive/logs/hive-server
[pocoweb@en83:/cube/services/hive/logs/hive-server]$ grep ' ExecReducer (ExecReducer.java:reduce(267)) ' www-data_201412*
@lisztli Status: RUNNING(16 hours 0 seconds since the beginning of calculation.)
刚才我又出发了一次计算,又是显示16个小时。
@sunliwen
/cube/log/supervisord_batch_server.log 目前的错误是:
HiveServerException: HiveServerException(errorCode=9, message='Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask', SQLState='08S01')
/cube/services/hive/logs/hive-server/www-data_20141222054949_650ba98d-4c0d-47b3-a6b2-029fe485ede5.log
这里面最下面可以看到hive的exception
显示16个小时是因为django settings里TIME_ZONE是UTC,但是系统配置是'Asia/Shanghai',挺低级的错误。
在服务器上做了修改,测试了1天和2天得计算都能顺利完成。
我将计算周期调整到100天,可能需要几个小时才能出结果,天亮后如果能够成功,这个issue就可以关了。等结果再看看。
好的,看看效果如何,辛苦了
计算100天得时候,只有一个计算步骤失败了。#105 里记录了100天计算时batch的问题。 另外剩下的这个失败计算,还需要处理。
好药师“小刀”反映batch server的计算有问题:
这可能是目前推荐结果不准确( #90 )的原因之一