Encoding issues - Githubissues

joepie91 commented 9 years ago

This is the canonical issue for the outstanding encoding issues in python-whois. It supersedes all the currently open issues; refer to those individual issues for more background.

Any help on resolving these would be much appreciated. Please leave a note on this thread if you plan on working on the issue, and explain what your suggested approach is. Also feel free to comment on the pending pull requests.

If you plan on submitting a new PR: Please test your solution for all the usecases below. Refactoring code to make it work well is perfectly okay and even desirable, even if it leads to large diffs. I'd like to have the absolute minimum amount of encoding-related code in the library code, to prevent these issues in the future.

Past work on this issue:

Merged: #4, #44
Pending: #59, #64
Known-problematic domains and IPs:

Note: IP WHOIS is not officially supported, but the listed IPs can serve as useful testcases regardless.

vatanim.com.tr
english.hk
bidtheatre.com
sesise.com
gcbcapp.com
rikke.fi
eldinamo.cl
ziddu.com
network.hu
biobiochile.cl
primicias.cl
planalto.gov.br (reported not to occur when reading the data from a file)
179.175.242.131
80.148.135.28
177.169.129.223
80.148.135.28
181.1.55.141
91.6.97.224
179.246.105.144
94.92.58.77

Also make sure to test the .co.th domains that are already in the python-whois test cases.

Usecases to keep in mind

pwhois: pwhois --raw domain.com > test/data/domain.com
- Must not store files differently from existing test data.
- Must work for files in non-Latin languages.
- When not redirecting output, it must print to stdout correctly.
pwhois: pwhois --json domain.com (#88)
- Must return JSON-specification-compliant UTF-8 strings.
pwhois: pwhois domain.com (#28, #88)
- Must work both for printing to stdout, and for optionally redirecting or piping to something else.
module: get_whois (parsed) (#28, #51, #57 )
- Both in standard and normalized variants
module: net.get_whois_raw (raw)
module: parse.parse_raw_whois (parsed)
- Both in standard and normalized variants
test.py: Domain name output (./test.py update)
test.py: Domain name output (./test.py run)
test.py: JSON diff output on error (./test.py run)
normalization: Must not interfere with normalization code.

Issues for which the usecase was unclear: #92

Caveats / Requirements

Must work on both Python 2 and Python 3 equivalently. The different string types across versions are a potential issue, and the socket module will return a different kind of string depending on the version. Currently supported versions are 2.6.x, 2.7.x, 3.3.x, 3.4.x.
Testing network-related issues (eg. encoding from source WHOIS server) can be challenging, because of often very strict rate limits. For the same reason, this can't be an automated test.
Not all WHOIS servers return Unicode or even ASCII. Some older ccTLD servers will return data in their local encoding.

floer32 commented 8 years ago

When you say, "Not all WHOIS servers return Unicode or even ASCII," do you mean, "Not all WHOIS servers return [UTF-8] or even ASCII"? Not trying to nit pick, just trying to make sure I'm not missing something in that statement

floer32 commented 8 years ago

What about something that tries most-common-encodings, then falls back to trying other encodings, and mixed encodings? Like PR #59 but instead of trying just 2, straight-up, use a library that has more complex behavior for dealing with weird cases — complex like a web browser, which actually results in "simple" results in a way. Such "browser" like behavior can be lossy on the edge cases (mojibake-ish) but since there are so many edge cases possible with WHOIS, maybe this is a way to catch the rest we haven't found or simulated yet. ("the internet is edge cases," says a colleague; and especially so in areas like this, since the WHOIS RFC doesn't specify encoding IIRC)

Suitable libraries, if approach is pursued:

ftfy: fixes text for you. ftfy "fixes Unicode that’s broken in various ways." -- I don't have experience with this lib yet, but it seems right on the nose. Seems ready to help, out of box, too
chardet, cchardet, and/or UnicodeDammit -- I do have experience gluing these together, for some finely tuned fall back logic that made sense for the proprietary product I was working on. This way would work too, would require additional original code and glue-code though.

FWIW, ftfy points out that it is very different from chardet.

I know this would be introducing a dependency, and pywhois doesn't really have deps right now. I think this is worth it though. #59-ish solutions will solve some cases, but if we try to solve things like ftfy can, without just using ftfy (or something like it) ... we will slowly wind up re-inventing something similar, I think.

What do you think ... should I give one a try?

Apologies, just saw #64. ^ cchardet is a bit more contentious because it's a C dependency and might have OS portability issues. might. Probably OK, but in any case, I believe ftfy is pure Python...

This is a very, very worthwhile reason to introduce a dependency. The problem is very general, and these libraries offer battle-hardened solutions. It will be a waste to do that work again. I really think using a library should be considered!

floer32 commented 8 years ago

How about this: optional dependencies. try to import the library, fall back to a simple implementation like the one in #59. README can just explain about the optional dependencies. Then it could even be a "package extra" for even smoother optional-installation-of-enhanced-unicode-edge-case-support. Some people will choose it, maybe some won't. It would just be a difference of pip install python-whois (for no extras) vs. pip install python-whois[extras] (get the extra dep, like chardet or ftfy). more info

Actually, come to think of it, this is exactly how BeautifulSoup's UnicodeDammit module works: "Unicode, Dammit’s guesses will get a lot more accurate if you install the chardet or cchardet Python libraries." I could see #64 being modified to be just like this.

I've just realized ftfy works with a smaller range of versions than python-whois. That could still be OK ... again if it's an "extra" and some users will just not-have it and we won't-use it. But maybe this is a reason to prefer chardet or cchardet ... wider support.

I will submit a PR doing somethin like described above, which may also combine/obviate #59 and #64 but w/o introducing mandatory deps

floer32 commented 8 years ago

OK, I've got code changes and all tests have passed w/ no external libs, with chardet, and with cchardet ... now I just need to go back tomorrow and prove it all and show my work, then will submit PR showing all that

floer32 commented 8 years ago

Yikes. I've run into some pretty serious dragons ... if you make everything unicode, you run into problems with the data loaded at the top of parse.py. csv doesn't support unicode. There are blobs of code that might address this without a dependency, but I have not gotten them working right ... kind of a mess.

There is a library out there that supports this, python-unicodecsv. It is compatible with "2.6, 2.7, 3.3, 3.4, 3.5, and pypy 2.4.0." After struggling with csv-and-unicode for a bit now, that seems like the smartest path ... will you consider a PR that adds unicodecsv as a mandatory dependency, or is that not OK no matter what?

floer32 commented 7 years ago

Hey just checking in here. Any thoughts?

tuler commented 7 years ago

Any progress on this? #64 seems like a good approach, but it's closed...

floer32 commented 7 years ago

I think my comment above still takes some action, for us to make progress.

I still think we need chardet, cchardet, and/or python-unicodecsv to deal with this. So I agree with you @tuler on the basic approach from #64, but check out my commits, where the user is given the option to choose between chardet and cchardet. I think we should should follow the example set by BeautifulSoup/UnicodeDammit, using chardet or cchardet to do the heavy lifting ... and at the same time we allow things to work even if neither lib is installed. It's just "progressive enhancement" based on what is installed.

So, I still think that's the right idea. But in my testing I found that this only solves part of the problem. Once we can detect character encodings and decode more data, that data gets further in the python-whois codebase, and reaches code that uses the csv module. As explained above, we probably can't finish better Unicode/encodings support without moving beyond this module's limitations. python-unicodecsv could help with this.

@joepie91 , I know you're hesitant to introduce these dependencies, but I don't think these issues will be solved without pulling out "the big guns." In my experience the WHOIS landscape is wild and messy. How do you feel about these 3 dependencies if we carefully manage to keep all of them optional?

tuler commented 7 years ago

@hangtwenty I agree with you that we need "big guns" for these issues. I've been looking for a good whois library, and was almost giving up (and trying to implement it myself), but I was happy when I found this one. It's the best among the ones I researched, especially because of the whois server "hopping".

But this unicode issue is really big for us because we're in Brazil.

I don't have the same issues with dependencies as @joepie91.

hummus commented 7 years ago

gotta +1 @hangtwenty and @tuler here regarding optional encoding detection. (I have encountered various use cases with those ancient local-encoding-only whois servs)

Introducing optional dependencies seems like a way to keep everyone happy, no?

But regarding adding a non-stdlib csv dependency, maybe this can be avoided since it seems to only be used for static-data and it can be assumed(?) they're all printable ascii so no need to fancy decode just unicode() or .decode("ascii") if you want unicode everywhere?

floer32 commented 7 years ago

@hummus doh, that's a good point. I think you're right. That seems like good news actually, hehe.

joepie91 / python-whois

Encoding issues #97

Past work on this issue:

Known-problematic domains and IPs:

Usecases to keep in mind

Caveats / Requirements