Open soulflyman opened 4 years ago
Hi, I don't know if this is a problem with the soupscraper or the skyscraper framework. Maybe this information help to perfect the framework.
When scraping starwars.soup.io I get the following error.
console output:
fetchedException in thread "main" clojure.lang.ExceptionInfo: Handler threw an error {:since "568380106", :date #inst "2015-04-16T00:00:00.000-00:00", :content "LIVE NOW! \n<a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a>", :date-from-header "2015-04-16", :type :video, :pages-only nil, :skyscraper.core/response {:body #object["[B" 0x1789f152 "[B@1789f152"], :headers {"Vary" "Accept-Encoding", "Link" "<https://www.soup.io/wp-json/>; rel=\"https://api.w.org/\", <https://www.soup.io/>; rel=shortlink", "CF-Cache-Status" "MISS", "Transfer-Encoding" "chunked", "Date" "Fri, 24 Jul 2020 10:17:45 GMT", "cf-request-id" "0421ed3db40000dfcf2fbb0200000001", "Expect-CT" "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"", "Cache-Control" "max-age=2678400", "Server" "cloudflare", "Content-Type" "text/html; charset=UTF-8", "Connection" "keep-alive", "X-Powered-By" ["PHP/7.4.8" "PleskLin"], "CF-RAY" "5b7ce4a92aa0dfcf-FRA"}}, :earliest nil, :id "568367933", :skyscraper.core/cache-key "soup/starwars/tv/568367933", :skyscraper.core/stage skyscraper.core/process-handler, :skyscraper.core/current-processor {:name :tv, :process-fn #object[soupscraper.core$fn__18660 0x613e4596 "soupscraper.core$fn__18660@613e4596"], :parse-fn #object[soupscraper.core$parse_json 0x252748de "soupscraper.core$parse_json@252748de"], :cache-template "soup/:soup/tv/:id"}, :url "https://starwars.soup.io/tv/show?id=568367933", :skyscraper.traverse/call-protocol :sync, :post #object[org.jsoup.nodes.Element 0x6db9be4c "<div id=\"post568367933\" class=\"post post_video author-member source-local f_nsfw f_nsfw f_post_nsfw f_blog_nsfw\" onmouseover=\"SOUP.Public.post_mouseover($(this), event);\" onmouseout=\"SOUP.Public.post_mouseout($(this), event);\"> \n <div class=\"meta\"> \n <div class=\"icons\"> \n <div class=\"icon type\">\n <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"LIVE NOW! https://www.youtube.com/watch?v=4UY64GfyovE\"></a>\n </div> \n <div class=\"icon author\"> \n <span class=\"user_container user890628\" onmouseover=\"if(window.SOUP) SOUP.Public.bubble(this, { 'classname': 'user' })\"><a class=\"url\" href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\"><img src=\"https://asset.soup.io/asset/2968/0814_0de7_32-square.jpeg\" alt=\"Dennkost\" title=\"Dennkost\" class=\"photo fn\" width=\"32\" height=\"32\"></a>\n <!--shared _user_bubble.html --> \n <div class=\"hidden bubble\"> \n <h4><a href=\"https://Dennkost.soup.io\">Dennkost</a></h4> \n <div class=\"attribution\">\n <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\">over 5 years ago</a>\n </div> \n </div> </span> \n </div> \n </div> \n </div> \n <div class=\"content-container\"> \n <!--soup _post_content.html --> \n <!--soup _post_full.html --> \n <div class=\"content \"> \n <div class=\"embed\"> \n </div> \n <a class=\"tv_promo\" href=\"/tv#568367933/LIVE-NOW-https-www-youtube-com-watch\">Play fullscreen</a> \n <div class=\"body\">\n LIVE NOW! \n <a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a> \n </div> \n </div> \n <!--soup _post_actions.html --> \n <ul class=\"actionbar\"> \n <li class=\"first permalink\"><a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"Permalink\">#</a></li> \n <li class=\"repost\"><span class=\"inner\"> </span></li> \n <li class=\"last react\"><a href=\"#nojs\" onclick=\"SOUP.Public.open_reaction($(this), 'https://www.soup.io/remote/reaction/frame?parent_id=568367933&origin_host=' + location.host); return false\">React</a></li> \n </ul> \n </div> \n</div>"], :processor :tv, :reactions [], :soup "starwars", :reposts [], :skyscraper.traverse/handler skyscraper.core/sync-handler, :num-on-page -3} at skyscraper.traverse$throw_handler_error_BANG_.invokeStatic(traverse.clj:250) at skyscraper.traverse$throw_handler_error_BANG_.invoke(traverse.clj:247) at skyscraper.traverse$wait_BANG_.invokeStatic(traverse.clj:258) at skyscraper.traverse$wait_BANG_.invoke(traverse.clj:254) at skyscraper.traverse$traverse_BANG_.invokeStatic(traverse.clj:275) at skyscraper.traverse$traverse_BANG_.invoke(traverse.clj:270) at skyscraper.core$scrape_BANG_.invokeStatic(core.clj:574) at skyscraper.core$scrape_BANG_.doInvoke(core.clj:566) at clojure.lang.RestFn.applyTo(RestFn.java:139) at clojure.core$apply.invokeStatic(core.clj:665) at clojure.core$apply.invoke(core.clj:660) at soupscraper.core$scrape_BANG_.invokeStatic(core.clj:240) at soupscraper.core$scrape_BANG_.invoke(core.clj:239) at soupscraper.core$download_soup.invokeStatic(core.clj:336) at soupscraper.core$download_soup.invoke(core.clj:328) at soupscraper.core$_main.invokeStatic(core.clj:356) at soupscraper.core$_main.doInvoke(core.clj:353) at clojure.lang.RestFn.applyTo(RestFn.java:137) at soupscraper.core.main(Unknown Source) Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (StringReader); line: 1, column: 2] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:637) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1917) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:773) at cheshire.parse$parse.invokeStatic(parse.clj:90) at cheshire.parse$parse.invoke(parse.clj:88) at cheshire.core$parse_string.invokeStatic(core.clj:208) at cheshire.core$parse_string.invoke(core.clj:194) at cheshire.core$parse_string.invokeStatic(core.clj:205) at cheshire.core$parse_string.invoke(core.clj:194) at soupscraper.core$parse_json.invokeStatic(core.clj:155) at soupscraper.core$parse_json.invoke(core.clj:154) at skyscraper.core$process_handler.invokeStatic(core.clj:434) at skyscraper.core$process_handler.invoke(core.clj:428) at clojure.lang.Var.invoke(Var.java:388) at skyscraper.core$sync_handler.invokeStatic(core.clj:461) at skyscraper.core$sync_handler.invoke(core.clj:457) at clojure.lang.Var.invoke(Var.java:388) at skyscraper.traverse$worker$fn__18380$fn__18389.invoke(traverse.clj:201) at skyscraper.traverse$worker$fn__18380.invoke(traverse.clj:201) at clojure.core.async$thread_call$fn__6604.invoke(async.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
In the log it looks like this:
20-07-24 10:48:37 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/14416/2980_43ba.jpeg 20-07-24 10:48:39 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/11463/7069_735e.png 20-07-24 10:48:39 edward-teach ERROR [skyscraper.traverse:168] - [worker 0] Handler threw an error java.lang.Thread.run Thread.java: 834 java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 628 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1128 ... clojure.core.async/thread-call/fn async.clj: 484 skyscraper.traverse/worker/fn traverse.clj: 201 skyscraper.traverse/worker/fn/fn traverse.clj: 201 ... skyscraper.core/sync-handler core.clj: 461 ... skyscraper.core/process-handler core.clj: 434 soupscraper.core/parse-json core.clj: 155 cheshire.core/parse-string core.clj: 205 cheshire.core/parse-string core.clj: 208 cheshire.parse/parse parse.clj: 90 com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken ReaderBasedJsonParser.java: 773 com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue ReaderBasedJsonParser.java: 1917 com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar ParserMinimalBase.java: 637 com.fasterxml.jackson.core.base.ParserMinimalBase._reportError ParserMinimalBase.java: 712 com.fasterxml.jackson.core.JsonParser._constructError JsonParser.java: 1840 com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (StringReader); line: 1, column: 2] location: #object[com.fasterxml.jackson.core.JsonLocation 0x4a00395 "[Source: (StringReader); line: 1, column: 2]"] originalMessage: "Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')" processor: #object[com.fasterxml.jackson.core.json.ReaderBasedJsonParser 0x5f97278d "com.fasterxml.jackson.core.json.ReaderBasedJsonParser@5f97278d"]
An here is the file which causes the trouble 568380106.txt (hadd to add .txt for the upload to github)
Can confirm the same issue when trying to back soup.gaf.io
Hi, I don't know if this is a problem with the soupscraper or the skyscraper framework. Maybe this information help to perfect the framework.
When scraping starwars.soup.io I get the following error.
console output:
In the log it looks like this:
An here is the file which causes the trouble 568380106.txt (hadd to add .txt for the upload to github)