nathell / soupscraper

dej, mam umierajoncom zupe
20 stars 3 forks source link

Perser error - unexpected character in JSON #11

Open soulflyman opened 4 years ago

soulflyman commented 4 years ago

Hi, I don't know if this is a problem with the soupscraper or the skyscraper framework. Maybe this information help to perfect the framework.

When scraping starwars.soup.io I get the following error.

console output:

fetchedException in thread "main" clojure.lang.ExceptionInfo: Handler threw an error {:since "568380106", :date #inst "2015-04-16T00:00:00.000-00:00", :content "LIVE NOW!&nbsp;&nbsp;&nbsp; \n<a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a>", :date-from-header "2015-04-16", :type :video, :pages-only nil, :skyscraper.core/response {:body #object["[B" 0x1789f152 "[B@1789f152"], :headers {"Vary" "Accept-Encoding", "Link" "<https://www.soup.io/wp-json/>; rel=\"https://api.w.org/\", <https://www.soup.io/>; rel=shortlink", "CF-Cache-Status" "MISS", "Transfer-Encoding" "chunked", "Date" "Fri, 24 Jul 2020 10:17:45 GMT", "cf-request-id" "0421ed3db40000dfcf2fbb0200000001", "Expect-CT" "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"", "Cache-Control" "max-age=2678400", "Server" "cloudflare", "Content-Type" "text/html; charset=UTF-8", "Connection" "keep-alive", "X-Powered-By" ["PHP/7.4.8" "PleskLin"], "CF-RAY" "5b7ce4a92aa0dfcf-FRA"}}, :earliest nil, :id "568367933", :skyscraper.core/cache-key "soup/starwars/tv/568367933", :skyscraper.core/stage skyscraper.core/process-handler, :skyscraper.core/current-processor {:name :tv, :process-fn #object[soupscraper.core$fn__18660 0x613e4596 "soupscraper.core$fn__18660@613e4596"], :parse-fn #object[soupscraper.core$parse_json 0x252748de "soupscraper.core$parse_json@252748de"], :cache-template "soup/:soup/tv/:id"}, :url "https://starwars.soup.io/tv/show?id=568367933", :skyscraper.traverse/call-protocol :sync, :post #object[org.jsoup.nodes.Element 0x6db9be4c "<div id=\"post568367933\" class=\"post post_video author-member  source-local f_nsfw f_nsfw f_post_nsfw f_blog_nsfw\" onmouseover=\"SOUP.Public.post_mouseover($(this), event);\" onmouseout=\"SOUP.Public.post_mouseout($(this), event);\"> \n <div class=\"meta\"> \n  <div class=\"icons\"> \n   <div class=\"icon type\">\n    <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"LIVE NOW!&nbsp;&nbsp;&nbsp; https://www.youtube.com/watch?v=4UY64GfyovE\"></a>\n   </div> \n   <div class=\"icon author\"> \n    <span class=\"user_container  user890628\" onmouseover=\"if(window.SOUP) SOUP.Public.bubble(this, { 'classname': 'user' })\"><a class=\"url\" href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\"><img src=\"https://asset.soup.io/asset/2968/0814_0de7_32-square.jpeg\" alt=\"Dennkost\" title=\"Dennkost\" class=\"photo fn\" width=\"32\" height=\"32\"></a>\n     <!--shared _user_bubble.html --> \n     <div class=\"hidden bubble\"> \n      <h4><a href=\"https://Dennkost.soup.io\">Dennkost</a></h4> \n      <div class=\"attribution\">\n       <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\">over 5 years ago</a>\n      </div> \n     </div> </span> \n   </div> \n  </div> \n </div> \n <div class=\"content-container\"> \n  <!--soup _post_content.html --> \n  <!--soup _post_full.html --> \n  <div class=\"content \"> \n   <div class=\"embed\"> \n   </div> \n   <a class=\"tv_promo\" href=\"/tv#568367933/LIVE-NOW-https-www-youtube-com-watch\">Play fullscreen</a> \n   <div class=\"body\">\n     LIVE NOW!&nbsp;&nbsp;&nbsp; \n    <a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a> \n   </div> \n  </div> \n  <!--soup _post_actions.html --> \n  <ul class=\"actionbar\"> \n   <li class=\"first permalink\"><a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"Permalink\">#</a></li> \n   <li class=\"repost\"><span class=\"inner\">&nbsp;</span></li> \n   <li class=\"last react\"><a href=\"#nojs\" onclick=\"SOUP.Public.open_reaction($(this), 'https://www.soup.io/remote/reaction/frame?parent_id=568367933&amp;origin_host=' + location.host); return false\">React</a></li> \n  </ul> \n </div> \n</div>"], :processor :tv, :reactions [], :soup "starwars", :reposts [], :skyscraper.traverse/handler skyscraper.core/sync-handler, :num-on-page -3}
    at skyscraper.traverse$throw_handler_error_BANG_.invokeStatic(traverse.clj:250)
    at skyscraper.traverse$throw_handler_error_BANG_.invoke(traverse.clj:247)
    at skyscraper.traverse$wait_BANG_.invokeStatic(traverse.clj:258)
    at skyscraper.traverse$wait_BANG_.invoke(traverse.clj:254)
    at skyscraper.traverse$traverse_BANG_.invokeStatic(traverse.clj:275)
    at skyscraper.traverse$traverse_BANG_.invoke(traverse.clj:270)
    at skyscraper.core$scrape_BANG_.invokeStatic(core.clj:574)
    at skyscraper.core$scrape_BANG_.doInvoke(core.clj:566)
    at clojure.lang.RestFn.applyTo(RestFn.java:139)
    at clojure.core$apply.invokeStatic(core.clj:665)
    at clojure.core$apply.invoke(core.clj:660)
    at soupscraper.core$scrape_BANG_.invokeStatic(core.clj:240)
    at soupscraper.core$scrape_BANG_.invoke(core.clj:239)
    at soupscraper.core$download_soup.invokeStatic(core.clj:336)
    at soupscraper.core$download_soup.invoke(core.clj:328)
    at soupscraper.core$_main.invokeStatic(core.clj:356)
    at soupscraper.core$_main.doInvoke(core.clj:353)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at soupscraper.core.main(Unknown Source)
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (StringReader); line: 1, column: 2]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:637)
    at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1917)
    at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:773)
    at cheshire.parse$parse.invokeStatic(parse.clj:90)
    at cheshire.parse$parse.invoke(parse.clj:88)
    at cheshire.core$parse_string.invokeStatic(core.clj:208)
    at cheshire.core$parse_string.invoke(core.clj:194)
    at cheshire.core$parse_string.invokeStatic(core.clj:205)
    at cheshire.core$parse_string.invoke(core.clj:194)
    at soupscraper.core$parse_json.invokeStatic(core.clj:155)
    at soupscraper.core$parse_json.invoke(core.clj:154)
    at skyscraper.core$process_handler.invokeStatic(core.clj:434)
    at skyscraper.core$process_handler.invoke(core.clj:428)
    at clojure.lang.Var.invoke(Var.java:388)
    at skyscraper.core$sync_handler.invokeStatic(core.clj:461)
    at skyscraper.core$sync_handler.invoke(core.clj:457)
    at clojure.lang.Var.invoke(Var.java:388)
    at skyscraper.traverse$worker$fn__18380$fn__18389.invoke(traverse.clj:201)
    at skyscraper.traverse$worker$fn__18380.invoke(traverse.clj:201)
    at clojure.core.async$thread_call$fn__6604.invoke(async.clj:484)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

In the log it looks like this:

20-07-24 10:48:37 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/14416/2980_43ba.jpeg
20-07-24 10:48:39 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/11463/7069_735e.png
20-07-24 10:48:39 edward-teach ERROR [skyscraper.traverse:168] - [worker 0] Handler threw an error
                                                   java.lang.Thread.run                 Thread.java:  834
                     java.util.concurrent.ThreadPoolExecutor$Worker.run     ThreadPoolExecutor.java:  628
                      java.util.concurrent.ThreadPoolExecutor.runWorker     ThreadPoolExecutor.java: 1128
                                                                    ...                                  
                                      clojure.core.async/thread-call/fn                   async.clj:  484
                                          skyscraper.traverse/worker/fn                traverse.clj:  201
                                       skyscraper.traverse/worker/fn/fn                traverse.clj:  201
                                                                    ...                                  
                                           skyscraper.core/sync-handler                    core.clj:  461
                                                                    ...                                  
                                        skyscraper.core/process-handler                    core.clj:  434
                                            soupscraper.core/parse-json                    core.clj:  155
                                             cheshire.core/parse-string                    core.clj:  205
                                             cheshire.core/parse-string                    core.clj:  208
                                                   cheshire.parse/parse                   parse.clj:   90
        com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken  ReaderBasedJsonParser.java:  773
  com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue  ReaderBasedJsonParser.java: 1917
com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar      ParserMinimalBase.java:  637
         com.fasterxml.jackson.core.base.ParserMinimalBase._reportError      ParserMinimalBase.java:  712
                  com.fasterxml.jackson.core.JsonParser._constructError             JsonParser.java: 1840
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
                                                at [Source: (StringReader); line: 1, column: 2]
           location: #object[com.fasterxml.jackson.core.JsonLocation 0x4a00395 "[Source: (StringReader); line: 1, column: 2]"]
    originalMessage: "Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')"
          processor: #object[com.fasterxml.jackson.core.json.ReaderBasedJsonParser 0x5f97278d "com.fasterxml.jackson.core.json.ReaderBasedJsonParser@5f97278d"]

An here is the file which causes the trouble 568380106.txt (hadd to add .txt for the upload to github)

fadenb commented 4 years ago

Can confirm the same issue when trying to back soup.gaf.io