Open nengine opened 9 years ago
@taganaka @tmaier, For some reason, if I use the code below in 0.5.0 non english unicode characters would show properly
def doc
return @doc if @doc
@doc = Nokogiri::HTML(@body) if @body && html? rescue nil
end
however this one would not. I'm not so sure what this function intended to do solve. Any suggestion is appreciated as I like to use 0.5.0 without monkey patching to the gem on my server. Thanks a lot.
def doc
return @doc if @doc
@body ||= ''
@body = @body.encode('utf-8', 'binary', invalid: :replace,
undef: :replace, replace: '')
@doc = Nokogiri::HTML(@body.toutf8, nil, 'utf-8') if @body && html?
end
Text inside
<title>လူကုန်ကူးခံရသူတွေရဲ့ဘဝခရီး - BBC ပင်မစာမျက်နှာ</title>
Text inside
<title> - BBC </title>
I'll take a look at this soon
Thanks for reporting
Thank you.
Sent from my iPhone
On Aug 31, 2015, at 3:32 AM, Francesco Laurita notifications@github.com wrote:
I'll take a look at this soon
Thanks for reporting
— Reply to this email directly or view it on GitHub.
Hi taganaka, Please let me know if you had a chance to look into?
I am trying to upgrade to 0.5.1 and saw the same issue. This is a regression of #40.
I don't think this project is maintained anymore.
Well it is. I'm still happily using it. Happy to accept PR too On Thu, Jan 5, 2017 at 06:59 nengine notifications@github.com wrote:
I don't think this project is maintained anymore.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/taganaka/polipus/issues/71#issuecomment-270663760, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXmRRD3FDIeh2AT8QodFdIKh1-EpHKsks5rPQVegaJpZM4FZWNc .
-- https://gild.com/?utm_campaign=Email-Signature&utm_medium=email&utm_source=gmail&utm_content=Gmail-Signature Francesco Laurita SVP Engineering | Gild, Inc. cell 415-694-9038
465 California Street, Suite 1200 San Francisco, CA 94104 www.gild.com
Ok Great. I didn't see activity for nearly 2 years so just thought it was not maintained anymore.
I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any settings I have to change?