hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
857 stars 174 forks source link

Move to newer html unescape #101

Closed kethan1 closed 3 years ago

kethan1 commented 3 years ago

Move to html.unescape from html_parser.HTMLParser().unescape. html_parser.HTMLParser().unescape has been deprecated for a while and has been removed in 3.9. Fixes #99.

kethan1 commented 3 years ago

html.unescape is only available in 3.4+. So it has support for all the supported Python versions and more. I am going to remove 2.7 from the CI (as it failed).

kethan1 commented 3 years ago

Huh, the CI seems to be failing for the master branch, so I assume it's not my changes that made it fail. I might be wrong as I didn't take an extensive look at the CI.

kethan1 commented 3 years ago

I've tested out this PR and it works reliably on 3.9. Haven't tested with other Python version, but I would assume it would work

ZhiyuanChen commented 3 years ago

Well, I don't think it is a good idea to remove support of Python 2.7 There are quite some options,

  1. try import http, and fallback to import html_parser.HTMLParser as http in case module not found
  2. make separate package for 3.4- and 3.4+ What do you think? @hellock
kethan1 commented 3 years ago

Python 2 is EOL, though I understand why it would be useful to keep support. I will add the fallback and add 2.7 and 3.5 back to the CI.

ZhiyuanChen commented 3 years ago

Yes I am aware Python 2 is EOL, so maybe it's better to provide separate package for python 2?

kethan1 commented 3 years ago

I think this package can still easily support 2.7, but I don't think it should. I don't think providing a separate package is a good idea. Almost nobody uses Python 2 anymore (compared to the number of people who use Python 3). Python 2 is removed or in the process of getting removed in many major Linux OSes. Tons of other projects are dropping or have dropped support for Python 2. If somebody really needs to used Python 2, they can just install an older version of the package.

ZhiyuanChen commented 3 years ago

Thank you very much for your work. I have asked Kai (the owner of this repo) It is time we deprecated support of python2. Could you please revert the recent commit?

kethan1 commented 3 years ago

Okay, will do!

ZhiyuanChen commented 3 years ago

I will merge this probably next week as I have an exam on Tuesday. After I built the one last package that supports Python 2, which contains the fixes in https://github.com/hellock/icrawler/pull/100

kethan1 commented 3 years ago

Okay, sure!

ZhiyuanChen commented 3 years ago

Sorry for the delay, just merged and published new package on PyPi

kethan1 commented 3 years ago

All good! Thanks for merging! It might have been better to release it under v0.7.0 or v1.0.0 as it does contain remove support for some versions.