scrapy / scurl

Performance-focused replacement for Python urllib
Apache License 2.0
21 stars 6 forks source link

icu is commented out in Chromium source #35

Open malloxpb opened 6 years ago

malloxpb commented 6 years ago

Right now, this line is commented out. We should figure out a way to enable icu for this project :)

The old chromium source of icu has this commented out probably because the third_party library ICU is difficult to configure. However, because the idna block of code in chromium source is commented out, the idna hostname, for example: ουτοπία.δπθ.gr is not converted to the ascii form.

At the moment, we have this handled by encoding the hostname with python encode() function, this can be seen here. However, it would be really nice to improve the performance of SCURL by figuring out how to use ICU in chromium source!

lopuhin commented 6 years ago

@nctl144 can you give more details here about why it's commented now at the moment?

malloxpb commented 6 years ago

Right I forgot to type this in more detail, sorry about that @lopuhin !

Right now there are a few problems, when icu is installed on the MacOS, this traceback is found when compiling Cython on branch #34:

third_party/chromium/url/url_canon_icu.cc:140:29: error: no matching function for call to 'ucnv_fromUChars_62'
    int required_capacity = ucnv_fromUChars(converter_, dest, dest_capacity,
                            ^~~~~~~~~~~~~~~
/usr/local/include/unicode/urename.h:627:25: note: expanded from macro 'ucnv_fromUChars'
#define ucnv_fromUChars U_ICU_ENTRY_POINT_RENAME(ucnv_fromUChars)
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/unicode/uvernum.h:113:40: note: expanded from macro 'U_ICU_ENTRY_POINT_RENAME'
#define U_ICU_ENTRY_POINT_RENAME(x)    U_DEF2_ICU_ENTRY_POINT_RENAME(x,U_ICU_VERSION_SUFFIX)
                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/unicode/uvernum.h:112:44: note: expanded from macro 'U_DEF2_ICU_ENTRY_POINT_RENAME'
#define U_DEF2_ICU_ENTRY_POINT_RENAME(x,y) U_DEF_ICU_ENTRY_POINT_RENAME(x,y)
                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/unicode/uvernum.h:111:43: note: expanded from macro 'U_DEF_ICU_ENTRY_POINT_RENAME'
#define U_DEF_ICU_ENTRY_POINT_RENAME(x,y) x ## y
                                          ^~~~~~
<scratch space>:213:1: note: expanded from here
ucnv_fromUChars_62

and this is the traceback on Linux:

third_party/chromium/url/url_canon_icu.cc: In member function 'virtual void url::ICUCharsetConverter::ConvertFromUTF16(const char16*, int, url::CanonOutput*)':
third_party/chromium/url/url_canon_icu.cc:141:67: error: invalid conversion from 'const char16* {aka const short unsigned int*}' to 'const UChar* {aka const char16_t*}' [-fpermissive]
                                             input, input_len, &err);
                                                                   ^
In file included from /usr/local/include/unicode/platform.h:25:0,
                 from /usr/local/include/unicode/ptypes.h:52,
                 from /usr/local/include/unicode/umachine.h:46,
                 from /usr/local/include/unicode/utypes.h:38,
                 from /usr/local/include/unicode/ustring.h:21,
                 from third_party/chromium/url/url_canon_icu.cc:13:
/usr/local/include/unicode/ucnv.h:1250:1: note:   initializing argument 4 of 'int32_t ucnv_fromUChars_62(UConverter*, char*, int32_t, const UChar*, int32_t, UErrorCode*)'
 ucnv_fromUChars(UConverter *cnv,
 ^