taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Unicode pages does not work anymore on 0.5.0 #71

Open nengine opened 9 years ago

nengine commented 9 years ago

I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any settings I have to change?

nengine commented 9 years ago

@taganaka @tmaier, For some reason, if I use the code below in 0.5.0 non english unicode characters would show properly

    def doc
      return @doc if @doc
      @doc = Nokogiri::HTML(@body) if @body && html? rescue nil
    end

however this one would not. I'm not so sure what this function intended to do solve. Any suggestion is appreciated as I like to use 0.5.0 without monkey patching to the gem on my server. Thanks a lot.

def doc
      return @doc if @doc
      @body ||= ''
      @body = @body.encode('utf-8', 'binary', invalid: :replace,
                                              undef: :replace, replace: '')
      @doc = Nokogiri::HTML(@body.toutf8, nil, 'utf-8') if @body && html?
    end
nengine commented 9 years ago

Text inside appear correctly in 0.4.0</p> <pre><code><title>လူကုန်ကူးခံရသူတွေရဲ့ဘဝခရီး - BBC ပင်မစာမျက်နှာ</title></code></pre> <p>Text inside <title> gone in 0.5.0. Only English text remains.</p> <pre><code><title> - BBC </title></code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/taganaka"><img src="https://avatars.githubusercontent.com/u/386629?v=4" />taganaka</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>I'll take a look at this soon</p> <p>Thanks for reporting</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/nengine"><img src="https://avatars.githubusercontent.com/u/82954?v=4" />nengine</a> commented <strong> 9 years ago</strong> </div> <div class="markdown-body"> <p>Thank you.</p> <p>Sent from my iPhone</p> <blockquote> <p>On Aug 31, 2015, at 3:32 AM, Francesco Laurita notifications@github.com wrote:</p> <p>I'll take a look at this soon</p> <p>Thanks for reporting</p> <p>— Reply to this email directly or view it on GitHub.</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/nengine"><img src="https://avatars.githubusercontent.com/u/82954?v=4" />nengine</a> commented <strong> 8 years ago</strong> </div> <div class="markdown-body"> <p>Hi taganaka, Please let me know if you had a chance to look into?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/tmaier"><img src="https://avatars.githubusercontent.com/u/350038?v=4" />tmaier</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>I am trying to upgrade to 0.5.1 and saw the same issue. This is a regression of #40.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/nengine"><img src="https://avatars.githubusercontent.com/u/82954?v=4" />nengine</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>I don't think this project is maintained anymore.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/taganaka"><img src="https://avatars.githubusercontent.com/u/386629?v=4" />taganaka</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>Well it is. I'm still happily using it. Happy to accept PR too On Thu, Jan 5, 2017 at 06:59 nengine <a href="mailto:notifications@github.com">notifications@github.com</a> wrote:</p> <blockquote> <p>I don't think this project is maintained anymore.</p> <p>— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <a href="https://github.com/taganaka/polipus/issues/71#issuecomment-270663760">https://github.com/taganaka/polipus/issues/71#issuecomment-270663760</a>, or mute the thread <a href="https://github.com/notifications/unsubscribe-auth/AAXmRRD3FDIeh2AT8QodFdIKh1-EpHKsks5rPQVegaJpZM4FZWNc">https://github.com/notifications/unsubscribe-auth/AAXmRRD3FDIeh2AT8QodFdIKh1-EpHKsks5rPQVegaJpZM4FZWNc</a> .</p> <p>-- <em> <a href="https://gild.com/?utm_campaign=Email-Signature&utm_medium=email&utm_source=gmail&utm_content=Gmail-Signature">https://gild.com/?utm_campaign=Email-Signature&utm_medium=email&utm_source=gmail&utm_content=Gmail-Signature</a></em> <em>Francesco Laurita</em> SVP Engineering | Gild, Inc. cell 415-694-9038</p> </blockquote> <p>465 California Street, Suite 1200 San Francisco, CA 94104 www.gild.com</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/nengine"><img src="https://avatars.githubusercontent.com/u/82954?v=4" />nengine</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>Ok Great. I didn't see activity for nearly 2 years so just thought it was not maintained anymore. </p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>