lidoapps / jhc-code

Automatically exported from code.google.com/p/jhc-code
0 stars 0 forks source link

新出现两个会导致异常出现的URI #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
新的抓取程序又发现两个新的异常,以下两个URI在处理时会��
�异常,需要添加新的单元测试,可能要修改源码。

2012-06-01T16:38:55.938Z    -5      39186 
http://item.taobao.com/item.htm?id=14788502928&spm=null LLR 
http://ershou.taobao.com/item.htm?id=14788502928 text/html #007 
20120601163855815+95 sha1:FQRMYQDUGAPCQUMD6CJ6XQPYSL2TTFZT - 
err=java.lang.NullPointerException
 java.lang.NullPointerException
    at cn.jhc.heritrix.writer.extractor.TaobaoItemExtractor.extractMarketPrice(TaobaoItemExtractor.java:38)
    at cn.jhc.heritrix.writer.extractor.ExtractorFacade.extractItem(ExtractorFacade.java:21)
    at cn.jhc.heritrix.writer.JdbcWriterProcessor.innerProcess(JdbcWriterProcessor.java:71)
    at org.archive.crawler.framework.Processor.process(Processor.java:109)
    at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:306)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:154)
2012-06-01T16:39:00.957Z    -5      39136 
http://item.taobao.com/item.htm?id=14852929673&spm=null LLR 
http://ershou.taobao.com/item.htm?id=14852929673 text/html #003 
20120601163900850+79 sha1:FMZHV3RCGNT5DPGYIBHC7XYSCJYAOG2N - 
err=java.lang.NullPointerException
 java.lang.NullPointerException
    at cn.jhc.heritrix.writer.extractor.TaobaoItemExtractor.extractMarketPrice(TaobaoItemExtractor.java:38)
    at cn.jhc.heritrix.writer.extractor.ExtractorFacade.extractItem(ExtractorFacade.java:21)
    at cn.jhc.heritrix.writer.JdbcWriterProcessor.innerProcess(JdbcWriterProcessor.java:71)
    at org.archive.crawler.framework.Processor.process(Processor.java:109)
    at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:306)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:154)

Original issue reported on code.google.com by luyanfe...@gmail.com on 1 Jun 2012 at 11:32

GoogleCodeExporter commented 9 years ago
这些是二手市场中的拍卖的物品,这些物品 
貌似没有店铺信息而且格式和正常的商品不一样!源码中最��
�能过滤掉这些拍卖的物品,单单看它的地址貌似就spm是空的�
��不知道SPM是什么意思!

Original comment by jjl7855...@gmail.com on 3 Jun 2012 at 2:27

GoogleCodeExporter commented 9 years ago
SPM应该是每个商店的都应该有的一串代码,如果spm是空的表��
�这个卖家没有店铺!

Original comment by jjl7855...@gmail.com on 3 Jun 2012 at 2:34

GoogleCodeExporter commented 9 years ago
这个错误的原因,应该是整体的参数格式和一般的商品格式��
�一样,特别是价格,拍卖的商品价格不固定的,像这个情况�
��话最好把拍卖的商品过滤掉比较好,但是怎么过滤是个问题
!

Original comment by jjl7855...@gmail.com on 3 Jun 2012 at 2:40

GoogleCodeExporter commented 9 years ago
如果网页中有 一个input 
的class为paimai-button的css属性的话就表示这件商品是拍卖品,��
�以过滤掉这件商品。

Original comment by jjl7855...@gmail.com on 3 Jun 2012 at 3:10

GoogleCodeExporter commented 9 years ago
或者直接在css中有一个div的id=“tbid-offer"并且class="tbid-key" 
就能知道这是一件拍卖品了。

Original comment by jjl7855...@gmail.com on 3 Jun 2012 at 3:17

GoogleCodeExporter commented 9 years ago
需要修改抽取的代码,做出判断,直接将拍卖的商品丢掉。��
�先试试看。

Original comment by luyanfe...@gmail.com on 4 Jun 2012 at 2:07

GoogleCodeExporter commented 9 years ago
老师那个抽取商品URL的代码写在那里的? 我没找到

Original comment by jjl7855...@gmail.com on 4 Jun 2012 at 2:18