postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.45k stars 446 forks source link

Invalid response for Russian text #464

Closed grigoriy-didorenko closed 5 years ago

grigoriy-didorenko commented 5 years ago

Hi,

I've tried to process the same URL with UTF-8 encoding:

  1. Only URL is provided (received UTF-encoded response)
  2. URL and contentType=text (works nice)
  3. [Major] URL and valid prefetched HTML provided (invalid response)

As a result, I received three totally different formats of response. How can I fix encoding and other strange symbols and receive a valid response (just like 2nd case).

URL: https://www.unian.ua/politics/10634892-zelenskiy-pidpisav-klyuchoviy-dlya-borotbi-iz-korupciyeyu-ukaz-shchodo-elektronnih-poslug.html

Expected Behavior

The same output for the three cases described below: with the only URL, with prefetched HTML, and with contentType defined.

Current Behavior

For exactly the same URL there is output in three different formats.

Steps to Reproduce

  1. Run Mercury.parse('https://www.unian.ua/politics/10634892-zelenskiy-pidpisav-klyuchoviy-dlya-borotbi-iz-korupciyeyu-ukaz-shchodo-elektronnih-poslug.html', {contentType: 'text'});
  2. Receive valid text. Russian/Ukrainian symbols are displayed correctly.

First problem:

  1. Run with valid prefetched HTML Mercury.parse('https://www.unian.ua/politics/10634892-zelenskiy-pidpisav-klyuchoviy-dlya-borotbi-iz-korupciyeyu-ukaz-shchodo-elektronnih-poslug.html', {html: prefetchedHtml, contentType: 'text'});
  2. Received invalid (encoded?) response. Please see sample below

Second problem:

  1. Run Mercury.parse('https://www.unian.ua/politics/10634892-zelenskiy-pidpisav-klyuchoviy-dlya-borotbi-iz-korupciyeyu-ukaz-shchodo-elektronnih-poslug.html');
  2. Receive invalid text. Russian/Ukrainian symbols are UTF-encoded.

Detailed Description

Received response (ContentType is not defined): I copy only a few symbols as on issue publish the text is automatically decoded and represented in a valid way. & #x414;& #x43E;& #x43A;& #x443;& #x43C;& #x435;& #x43D;& #x442;

Received response (prefetched HTML):

/: 707=0G0TBLAO, 4>:CG5@3>25 ?@>2545==O ?>2=>3> 0C48BC @>1>B8 45@602=8E @5TAB@V2.'8B09B5 B0:>65;5=AL:89 2V42V402 V (0EB0@5< (D>B>)�!L>3>4=V 2 #:@0W=V =0@0E>2CTBLAO ?>=04 350 45@602=8E @5TAB@V2, 1V;LHVABL 7 O:8E DC=:FV>=CNBL 01A>;NB=> =55D5:B82=> B0 =5?@>7>@>. &5 T 3>;>2=>N ?@8G8=>N 28=8:=5==O @5945@AB20, :>@C?FVW B0 V=H8E AE5< ?@8 =040==V @V7=>A;C3, =0?@8:;04 C 1C4V25;L=V9 G8 75 $54>@>2.0 9>3> A;>2040;LH5 WE=T C?>@O4:C20==O, ?>H8@5==O T48=>3> V45=B8DV:0B>@0 DV78G=>W >A>18 4;O ?>2'O70==O 40=8E 7 @V7=8E @5TAB@V2 V 2?@>20465==O 5;5:B@>==>W 270T<>4VW 40ABL 7<>3C =5 28ABV9=> >4=C 9 BC A0@2V4:8 V 7=0G=> A?@>AB8B8 2AV 45@602=V ?>A;C38.�@>7>@0 B0 =04V9=0 @>1>B0 45@6@5TAB@V2 1C45 70?>@C:>N 157?5G=>3> @>728B:C 1V7=5AC B0 5D5:B82=>3> C?@02;V==O :@0W=>N 2 FV;>402 @04=8: 3;028 45@6028.>4=>G0A C:07 ?@57845=B0 ?5@5410G0T 70?@>20465==O T48=>3> 251-?>@B0;C 5;5:B@>==8E ?>A;C3, 45 C:@0W=FV 7<>6CBL 70 4>?><>3>N 5;5:B@>==>3> :01V=5BC >B@8ABC? 4> V=D>@4> A515 2 45@602=8E @5TAB@0E (=0?@8:;04, ?@> , 75@B, ?>40B:8 V 4>E>48) B0 ?@V>@8B5B=V 5;5:B@>==V 45@602=V ?>A;C38. "0:, C:07>< ?5@H>G5@3>2> ?5@5410G0TBLAO @50;V70FVO 5;5:B@>==8E ?>A;C3, ?>2O70=8E V7 =0@>465==O< 48B8=8 (?@>5:B �T0;OB:>�), @5TAB@0FVTN WW 6820==O, @5TAB@0FVTN DV78G=>W >A>18 O: ?;0B=8:0 ?>40B:V2 ?V4 G0A ?5@H>3> >D>@@B0 3@>N =0O2=>ABV C 2>4VO 4>:CI>.�0@07V 2>48<> 4>A;V465==O I>4> :>@C?FVW C AD5@V 04A;C3. 65 B>G=> <>6=0 A:070B8, I> F5 6 @ 200 =091V;LH 206;828E ?>A;C3 4;O 3@>1 ?5@H>G5@3>2> 70?@>2048B8 WE=T =040==O 2 @568=;09=. >65= C:@0W=5FL 7<>65 10G8B8 =0H ?;0= B0 ?5@51V3 @50;V70FVW F8E 70240=L =0 A?5FV0;L=>2V4>$54>@>2."0:>6 C:07>< 70?>G0B:>2CTBLAO @>1>B0 =04 ?@>2545==O< 5;5:B@>==8E 281>@V2 B0 5;5:B@>==>3> ?5@5?8AC =0A5;5==O 2 #:@0W=V.!5@54 :;NG>28E 70240=L, ?5@5410G5=8E C:07>@5==O =0?>2=5==O 48=>3> 45@602=>3> 45>3@0DVG=3> @5TAB@C, O:89 =0@07V 14 20465==O V==>20FV9=8E 70A>1V2 5;5:B@>==>W V45=B8DV:0FVW.�V4ACB=VABL C 3@>1V2 eID T :;NG>28< 10@T@>< =0 H;OEC @>728B:C 5-?>A;C3 B0 F8D@>2>W 5:>=>A5=8 FL>3> @>:C @>7?>G0B8 2840GC ?0A?>@BV2 C 283;O4V ID-:0@B>: >4@07C 7 5;5:B@>==8< ?V4?8A>@8B5B @>18<> =0 70?@>20465==V B0 20FV9=8E ?V4E>4V2, O: MobileID B0 SmartID, 0 B0:>6 V=H8E 0;LB5@=0B82=8E 70A>1V2. )>1 :>65= C:@0W=5FL 7<>30 H284H5 2V4GCB8 ?5@52038 F8D@>28E ?5@5B2>@5=L C :@0W=V�, 707=0G82 @04=8: ?@57845=B0.

mindfulme commented 4 years ago

Did you get it to work?

mindfulme commented 4 years ago

I have similar in my parsed html, lots of symbols that supposed to be Russian text

Screenshot 2020-08-03 at 02 17 33