platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
143 stars 46 forks source link

find 'Not a data file' error #15

Closed sskmtm closed 1 year ago

sskmtm commented 1 year ago

发现一个莫名其妙的错误

执行一个爬取 amazon profile 页面任务的时候发生如下错误

不知道是哪个文件 ‘Not a data file'

15:26:13.984 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource regex-normalize.xml | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-filter/1.9.9/pulsar-filter-1.9.9.jar!/regex-normalize.xml
15:26:14.314 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource protocol-plugins.txt | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-protocol/1.9.9/pulsar-protocol-1.9.9.jar!/protocol-plugins.txt
15:26:14.742 [r-worker-2] INFO  a.p.p.crawl.protocol.ProtocolFactory - Supported protocols: crowd, browser
15:26:14.769 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource prefix-urlfilter.txt | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-filter/1.9.9/pulsar-filter-1.9.9.jar!/prefix-urlfilter.txt
15:26:14.773 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource suffix-urlfilter.txt | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-filter/1.9.9/pulsar-filter-1.9.9.jar!/suffix-urlfilter.txt
15:26:14.779 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource regex-urlfilter.txt | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-filter/1.9.9/pulsar-filter-1.9.9.jar!/regex-urlfilter.txt
15:26:14.783 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource automaton-urlfilter.txt | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-filter/1.9.9/pulsar-filter-1.9.9.jar!/automaton-urlfilter.txt
15:26:14.881 [r-worker-2] INFO  a.p.p.c.parse.html.PrimerHtmlParser - className: PrimerHtmlParser defaultCharEncoding: utf-8 parseFilters: AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor, AmazonJdbcSinkSQLExtractor
15:26:14.908 [r-worker-2] INFO  a.p.pulsar.common.ResourceLoader - Find resource parse-plugins.xml | jar:file:/Users/kust/.m2/repository/ai/platon/pulsar/pulsar-parse/1.9.9/pulsar-parse-1.9.9.jar!/parse-plugins.xml
15:26:14.912 [r-worker-2] INFO  a.p.p.c.parse.html.PrimerHtmlParser - className: PrimerHtmlParser defaultCharEncoding: utf-8
15:26:14.913 [r-worker-2] INFO  a.p.pulsar.crawl.parse.ParserFactory - Active parsers: 
          ----------Params Table----------               
                     Name   Value                    
                text/html: ai.platon.pulsar.crawl.parse.html.PrimerHtmlParser
 application/x-javascript: 
                 text/xml: ai.platon.pulsar.parse.tika.TikaParser
           text/aspdotnet: ai.platon.pulsar.crawl.parse.html.PrimerHtmlParser
      application/rss+xml: ai.platon.pulsar.parse.tika.TikaParser
                        *: ai.platon.pulsar.parse.tika.TikaParser
    application/xhtml+xml: ai.platon.pulsar.crawl.parse.html.PrimerHtmlParser

15:26:14.918 [r-worker-2] INFO  a.p.pulsar.crawl.parse.PageParser - maxParseTime: PT1M maxParsedLinks: 200 groupMode: BY_HOST ignoreExternalLinks: false maxUrlLength: 1024 defaultAnchorLenMin: 2 defaultAnchorLenMax: 200
15:26:14.963 [r-worker-2] INFO  a.p.pulsar.persist.gora.GoraStorage - Backend data store: FileBackendPageStore realSchema: FileBackendPageStore
15:26:14.964 [r-worker-2] INFO  a.p.p.p.AutoDetectStorageProvider - Storage is created: class ai.platon.pulsar.persist.gora.FileBackendPageStore realSchema: FileBackendPageStore
15:26:15.101 [r-worker-2] WARN  a.p.pulsar.crawl.StreamingCrawler - Unexpected exception
java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)
    at ai.platon.pulsar.persist.gora.FileBackendPageStore.readAvro(FileBackendPageStore.kt:104)
    at ai.platon.pulsar.persist.gora.FileBackendPageStore.readAvro(FileBackendPageStore.kt:88)
    at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:40)
    at ai.platon.pulsar.persist.gora.FileBackendPageStore.get(FileBackendPageStore.kt:30)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
    at ai.platon.pulsar.persist.WebDb.getOrNull(WebDb.kt:72)
    at ai.platon.pulsar.persist.WebDb.getOrNull$default(WebDb.kt:66)
    at ai.platon.pulsar.crawl.component.LoadComponent.createPageShell(LoadComponent.kt:264)
    at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred0(LoadComponent.kt:210)
    at ai.platon.pulsar.crawl.component.LoadComponent.loadWithRetryDeferred(LoadComponent.kt:107)
    at ai.platon.pulsar.crawl.component.LoadComponent.loadDeferred(LoadComponent.kt:94)
    at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred$suspendImpl(AbstractPulsarContext.kt:326)
    at ai.platon.pulsar.context.support.AbstractPulsarContext.loadDeferred(AbstractPulsarContext.kt)
    at ai.platon.pulsar.session.AbstractPulsarSession.loadAndCacheDeferred(AbstractPulsarSession.kt:207)
    at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:197)
    at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
    at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred$suspendImpl(AbstractPulsarSession.kt:190)
    at ai.platon.pulsar.session.AbstractPulsarSession.loadDeferred(AbstractPulsarSession.kt)
    at ai.platon.pulsar.crawl.StreamingCrawler.loadWithEventHandlers(StreamingCrawler.kt:520)
    at ai.platon.pulsar.crawl.StreamingCrawler.loadUrl(StreamingCrawler.kt:416)
    at ai.platon.pulsar.crawl.StreamingCrawler.runUrlTask(StreamingCrawler.kt:405)
    at ai.platon.pulsar.crawl.StreamingCrawler.access$runUrlTask(StreamingCrawler.kt:68)
    at ai.platon.pulsar.crawl.StreamingCrawler$runWithStatusCheck$2.invokeSuspend(StreamingCrawler.kt:379)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Caused by: java.io.EOFException: null
    at org.apache.avro.io.BinaryDecoder$InputStreamByteSource.readRaw(BinaryDecoder.java:827)
    at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:349)
    at org.apache.avro.io.BinaryDecoder.readFixed(BinaryDecoder.java:302)
    at org.apache.avro.io.Decoder.readFixed(Decoder.java:150)
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:100)
    ... 32 common frames omitted
platonai commented 1 year ago

We suggest you use MongoStore instead of FileBackendPageStore.

a.p.pulsar.persist.gora.GoraStorage - Backend data store: FileBackendPageStore realSchema: FileBackendPageStore

  1. On windows/linux, pulsarR detects mongo service automatically and use it if it's started
  2. If mongo service is not detected automatically, add this line to force pulsarR use mongo: System.setProperty(CapabilityTypes.STORAGE_DATA_STORE_CLASS, AppConstants.MONGO_STORE_CLASS)
  3. FileBackendPageStore is designed for test purpose, not for a production environment, and there is still a solution for you to use it: delete $HOME/.pulsar/data/store
sskmtm commented 1 year ago

ok,通过删除 $HOME/.pulsar/data/store 问题已经解决