sunliwen / poco

Poco v1.6
0 stars 1 forks source link

[好药师] Batch Server计算有问题 #92

Closed jacobfan closed 5 years ago

jacobfan commented 9 years ago

好药师“小刀”反映batch server的计算有问题:

这可能是目前推荐结果不准确( #90 )的原因之一

ujlc vbls87ksh yl

brook-peng commented 9 years ago

上周,jacobfan 说今天会分析好药师这边推荐准确性的问题,能否给个准确的时间点啊

aoeuidht commented 9 years ago

@brook-peng 今天晚上我就会处理,看看是哪里导致 batch server 失效。

aoeuidht commented 9 years ago

/home/pocoweb/log/supervisor_celeryd.log 中,会找到如下的日志:

[2014-12-11 13:03:56,326: ERROR/MainProcess] Task apps.apis.recommender.tasks.write_log[b45251ce-5c27-4feb-80bd-6b0faff65874] raised unexpected: InvalidStringData("strings in documents must be valid UTF-8: 'http://www.ehaoyao.com/search/\\xb6\\xc5\\xc0\\xd9\\xcb\\xb9?cts=613'",)
Traceback (most recent call last):
  File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/pocoweb/cube/apps/poco/poco/apps/apis/recommender/tasks.py", line 69, in write_log
    _write_log(site_id, content)
  File "/home/pocoweb/cube/apps/poco/poco/apps/apis/recommender/tasks.py", line 57, in _write_log
    mongo_client.writeLogToMongo(site_id, content)
  File "/home/pocoweb/cube/apps/poco/poco/common/mongo_client.py", line 506, in writeLogToMongo
    c_raw_logs.insert(content)
  File "/home/pocoweb/cube/apps/poco/env/local/lib/python2.7/site-packages/pymongo/collection.py", line 402, in insert
    gen(), check_keys, self.uuid_subtype, client)
InvalidStringData: strings in documents must be valid UTF-8: 'http://www.ehaoyao.com/search/\xb6\xc5\xc0\xd9\xcb\xb9?cts=613'

其中的 http://www.ehaoyao.com/search/\xb6\xc5\xc0\xd9\xcb\xb9?cts=613 编码并不是utf8,而是gbk:

# echo '00000000: b6c5 c0d9 cbb9' | xxd -r | iconv -f gbk -t utf8
杜蕾斯

我看了 好药师 网页的编码都是 utf8, 但是通过api传过来的 refer、url 编码都变成了 gbk 编码。 所以有两个解决思路:

  1. 检查一下发送请求方的编码问题
  2. 在入口处做强制转换,但是这种方式是一种 trick

目前我还没有找到处问题的 url 是如何写入到 log 中的。

aoeuidht commented 9 years ago

@brook-peng hi 小刀,我们已经提了pull request,等review 完成之后就可以上线修复这个问题。

brook-peng commented 9 years ago

好的,谢谢了

aoeuidht commented 9 years ago

@jacobfan

更改临时路径后,只有一个任务计算出了问题:

Traceback (most recent call last):
  File "/cube/apps/poco/poco/services/batch/server.py", line 155, in _execJob
    callable()
  File "/cube/apps/poco/poco/services/batch/server.py", line 202, in do_hive_based_calculations
    connection, SITE_ID, self.getWorkDir(), backfilled_raw_logs_path)
  File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 633, in hive_based_calculations
    do_calculations(connection, site_id, work_dir, backfilled_raw_logs_path, client)
  File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 615, in do_calculations
    calc_click_rec_buy(site_id, connection, client)
  File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 52, in wrapped_function
    result = function(*arg, **kws)
  File "/home/pocoweb/cube/apps/poco/poco/services/batch/statistics/hive_based_calculations.py", line 453, in calc_click_rec_buy
    client.execute("INSERT OVERWRITE TABLE rec_buy "
  File "pylib/hive_service/ThriftHive.py", line 63, in execute
    self.recv_execute()
  File "pylib/hive_service/ThriftHive.py", line 84, in recv_execute
    raise result.ex
HiveServerException: HiveServerException(errorCode=9, message='Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask', SQLState='08S01')
2014-12-14 03:47:43|cd23526b-7e5b-4da5-b8a9-f6aefe4f2fc9|INFO|root|FlowEnd: HiveBasedStatisticsFlow

我去google了一下那个错误码 08S01,有提到网络问题的,有提到内存配置问题的(http://gtzygtzygtzy.blog.163.com/blog/static/124198088201364734321/). 从我们的日志看,出现了broken pipe, 可能是网络出现了问题:

www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:576:2014-12-14 03:47:43,160 INFO  exec.ScriptOperator (ScriptOperator.java:processLine(518)) - ErrorStreamProcessor calling reporter.progress()
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:578:2014-12-14 03:47:43,163 ERROR exec.ScriptOperator (ScriptOperator.java:processOp(343)) - Error in writing to script: Broken pipe
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:579:2014-12-14 03:47:43,163 INFO  exec.ScriptOperator (ScriptOperator.java:run(565)) - StreamThread ErrorProcessor done
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:580:2014-12-14 03:47:43,164 INFO  exec.ScriptOperator (ScriptOperator.java:displayBrokenPipeInfo(230)) - The script did not consume all input data. This is considered as an error.
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:582:2014-12-14 03:47:43,172 FATAL ExecReducer (ExecReducer.java:reduce(267)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:611:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:616:Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"75047","reducesinkkey1":1.418286216575E9},"value":{"_col0":"75047","_col1":1.418286216575E9,"_col2":"0","_col3":"V","_col4":"ViewItem","_col5":"","_col6":"2348","_col7":0.0,"_col8":0,"_col9":"0"},"alias":0}
www-data_20141214034747_d478d3a4-2287-4022-b0ae-5c49dc20bc4c.log:641:2014-12-14 03:47:43,514 ERROR exec.ExecDriver (SessionState.java:printError(343)) - Ended Job = job_local_0001 with errors

这个问题不确定是不是每次都出现,我再观察一天看看。

sunliwen commented 9 years ago
Status: RUNNING(16 hours 23 seconds since the beginning of calculation.)

服务器时间不对,开始batch计算的瞬间已经16个小时的误差了。

aoeuidht commented 9 years ago

@sunliwen 今天早上,显示计算了7个小时,所以你看到的应该是真的运行了16h。

@jacobfan 我看了一下batch 出现异常的点,各种记录都会导致,但是相同的是,它们的 req_id 都是 0 ,不知道有没有什么关联。

[pocoweb@en83:/cube/services/hive/logs/hive-server]$ pwd
/cube/services/hive/logs/hive-server
[pocoweb@en83:/cube/services/hive/logs/hive-server]$ grep ' ExecReducer (ExecReducer.java:reduce(267)) ' www-data_201412*
sunliwen commented 9 years ago

@lisztli Status: RUNNING(16 hours 0 seconds since the beginning of calculation.)

刚才我又出发了一次计算,又是显示16个小时。

jacobfan commented 9 years ago

@sunliwen

/cube/log/supervisord_batch_server.log 目前的错误是:

HiveServerException: HiveServerException(errorCode=9, message='Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask', SQLState='08S01')

/cube/services/hive/logs/hive-server/www-data_20141222054949_650ba98d-4c0d-47b3-a6b2-029fe485ede5.log

这里面最下面可以看到hive的exception

sunliwen commented 9 years ago

显示16个小时是因为django settings里TIME_ZONE是UTC,但是系统配置是'Asia/Shanghai',挺低级的错误。

sunliwen commented 9 years ago

在服务器上做了修改,测试了1天和2天得计算都能顺利完成。

我将计算周期调整到100天,可能需要几个小时才能出结果,天亮后如果能够成功,这个issue就可以关了。等结果再看看。

brook-peng commented 9 years ago

好的,看看效果如何,辛苦了

sunliwen commented 9 years ago

计算100天得时候,只有一个计算步骤失败了。#105 里记录了100天计算时batch的问题。 另外剩下的这个失败计算,还需要处理。