scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Python 3 support #263

Closed extesy closed 8 years ago

extesy commented 11 years ago

Python 3 is several years old and most of packages now support it (even django!). It would be really nice to support it in scrapy as well.

artemdevel commented 11 years ago

Scrapy uses Twisted in its core, so support python 3 at least depended on Twisted python 3 support. Twisted development team has a project to port Twisted on python 3 and it is in progress, so I think as soon as Twisted is ported to python 3 Scrapy will get good chances to be ported as well.

todoit commented 11 years ago

mark

nramirezuy commented 11 years ago

we are waiting for http://www.python.org/dev/peps/pep-3156/

estin commented 11 years ago

for python3 I am developing
https://bitbucket.org/estin/pomp like scrapy but very small, unstable and without hard twisted dependency

coodoing commented 11 years ago

mark' the latest development branch 0.17 did not support py3

ariddell commented 11 years ago

@nramirezuy there's a reference implementation for pep 3156 here: https://code.google.com/p/tulip/

muelli commented 9 years ago

Is there a list of what parts of Twisted are used? Twisted have a python3 migration plan here: http://twistedmatrix.com/trac/wiki/Plan/Python3 It might be worthwhile to investigate whether the used parts of Twisted are already ported.

txtsd commented 9 years ago

Can scrapy not be made to work with python 3, now that asyncio is available?

ianozsvald commented 9 years ago

+1 for Python 3.4 support. After a year using Python 3 (mainly sklearn, numpy, Anaconda, matplotlib, networkx etc) this is the first blocker I've had forcing me to downgrade.

The only other Python2.7-only project that I'm lightly using is Apache Spark and 3.4+ support is scheduled for their next release. In their issue tracker I posted some stats for Python 3 adoption - roughly speaking it is ">40%" (accepting the self-selected group of survey participants): https://issues.apache.org/jira/browse/SPARK-4897?focusedCommentId=14303154&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14303154

kmike commented 9 years ago

@ianozsvald we are working on it, it is a priority :)

Scrapy is the worst kind of project to port to Python 3 - it depends on Twisted (which is not ported to Python 3 yet - some subset of Twisted works though), and it works at outside world / python world boundary, so there are many questions about unicode. "Outer World" Scrapy works with is wild - there is no a well-defined encoding we can decode/encode data from/to. Encoding rules are sometimes crazy - e.g. browsers (which Scrapy aims to emulate) can use different charsets for different parts of a single URL, e.g. cp1251 for /path and utf-8 for GET parameter values. I've ported a lot of code to Python 3 (including most of NLTK and tens of other Python packages), but still getting porting details wrong for Scrapy (e.g. https://github.com/scrapy/scrapy/pull/837 is wrong).

Some parts of Scrapy are already ported to Python 3. We're running tests for Python 3.3 on Travis to prevent regressions; ~240 tests pass in 3.3, out of ~1000. There is a GSoC project to port Scrapy to Python 3.x; I think we should make a good progress this summer.

kmike commented 9 years ago

There is also https://github.com/mitmproxy/mitmproxy Scrapy dependency which doesn't have Python 3 support yet, but it is used only in tests.

ianozsvald commented 9 years ago

@kmike Hey Mikhail! You are a man of many projects :-) Glad to hear it is being worked on, I didn't get that impression from the early parts of this thread and couldn't see any other porting docs. I quite agree that this project (just like Flask et al.) is going to be hard, dealing with the interface to the outside world is horrid. I certainly didn't know that URLs themselves could have mixed encodings :-( Given the continual migration to Python 3 for personal projects (50/50 according to the survey I linked vs Python 2.7) and >40% for work, the need for scrapy's Py3 support is only going to get stronger. Bon chance!

pbronez commented 9 years ago

+1 for Python 3 support! Thanks for the hard work you guys are putting into it, hope GSoC goes well.

nuschk commented 9 years ago

:+1: as well, would really love to be able to use python 3 with scrapy! And many thanks your effort!

vmarkovtsev commented 9 years ago

You can use my patches with ported twisted.web.client.Agent and friends from my fork.

nyov commented 8 years ago

Are there still outside blockers for porting to python3? (twisted libs, etc.?) Would love to see a list of those, if one has been made.

Also, in the name of eventual portability (e.g. asyncio?) how do people feel about dropping dependencies on twisted for the web/downloader part? I recall there was a gsoc idea for this? Would be interesting to see if a downloader using pycurl bindings might work with twisted here. (Though pycurl has no cffi bindings at this time, so no pypy support.)

curita commented 8 years ago

There's a comprehensive status of the twisted dependencies in Berker's proposal. @berkerpeksag, would you mind if we put it up on our wiki for reference?

berkerpeksag commented 8 years ago

Sure, but that list is a bit outdated. For example, twisted.web.static has already been ported to Python 3. You may want to check twisted/python/dist3.py first.

curita commented 8 years ago

Will do, thanks!

curita commented 8 years ago

Here's the updated list: https://github.com/scrapy/scrapy/wiki/PY3%3A-Twisted-Dependencies.

nyov commented 8 years ago

Thank you both!

tonal commented 8 years ago

https://twistedmatrix.com/trac/ticket/7540 closed

curita commented 8 years ago

That's great news, thanks for reporting! I just updated the wiki.

ianozsvald commented 8 years ago

Hello all. I've got a Lightning Talk on Python3.5 at my next PyDataLondon meet (200+ data scientists in the room). Someone is bound to ask about scrapy/twisted on Python 3.4+, could someone comment on the current state? It isn't clear to me from the links above if enough of twisted has been ported for scrapy to run on Python 3 (or will soon)?

curita commented 8 years ago

Hi @ianozsvald, glad to hear about the interest in python3 support!

Currently scrapy doesn't run in python3, not even a meaningful subset of it, but twisted support isn't the only issue holding us back. Most of the twisted modules used in scrapy are already ported, and in some cases the features that use them could be deactivated, like telnet or mail (well, extensions that use mail could be deactivated or changed to not use mail in python3 for instance). twisted.web.client.Agent is a problem anyhow, but this can be patched in our side.

We stopped the python3 integration for some time because we couldn't agree on the type we should use to represent urls but thankfully that matter was resolved, though it hasn't been coded yet.

So, there aren't any big stoppers (not that I know of), just the time to get around it.

We haven't defined a deadline yet but it's something we want to see before the end of the year. On top of that, this weekend some scrapinghubbers will hold a sprint to accelerate the support, so maybe there'll be news sooner than expected :wink:

ianozsvald commented 8 years ago

Hi @curita, thanks for the note. For my data science audience I think scrapy is the only non-python-3.4 package that matters, everything else that they (and I) use is already running with Python 3.4. I wish you all luck in the conversion, knowing the data science stack is almost fully 3.4 compliant really helps when planning larger-scale projects.

rmax commented 8 years ago

Is there any plan to replace mitmproxy requirement for tests?

kmike commented 8 years ago

@darkrho I don't know; I was thinking about porting it, not replacing. Are there alternatives?

kmike commented 8 years ago

@ianozsvald I know your pain; Scrapy is the only reason I'm using Python 2 now :) At EuroPython me and @dangra tried to unblock the further porting - the bottleneck was in Request and Response objects, and they are ported now in https://github.com/scrapy/scrapy/pull/1384. It is still a long road to full Python 3 support, but we're in a much better shape now - 480 507 tests are passing in Python 3, compared to 248 before the sprint. Working Request and Response objects open a gate for other's contributions, so I expect Python 3 Scrapy support to get more love soon.

ianozsvald commented 8 years ago

@kmike hey, that's lovely to hear (and Graham Markall [of Continuum] told me about the sprint), we'll certainly note this when we talk next week. Cheers!

kmike commented 8 years ago

For anyone interested in contributing I've created a wiki page (https://github.com/scrapy/scrapy/wiki/Python-3-Porting) with some information & guidelines.

stonebig commented 8 years ago

apparantly, twisted for python 3 is out ... https://twitter.com/hawkieowl/status/670885245328166912

stonebig commented 8 years ago

I don't understand, I suppose examples are not rewritten for Python3:

import scrapy 

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)

with this error

>>> ================================ RESTART ================================
>>> 
Traceback (most recent call last):
  File "D:/WinPython/basedir34/buildBarebone/winpython-3.4.3/notebooks/scrapy.py", line 1, in <module>
    import scrapy
  File "D:/WinPython/basedir34/buildBarebone/winpython-3.4.3/notebooks\scrapy.py", line 4, in <module>
    class MySpider(scrapy.Spider):
AttributeError: 'module' object has no attribute 'Spider'
>>> 

Any idea how it should be written in Python 3 ?

kmike commented 8 years ago

Hey @stonebig,

There are Scrapy parts which work in Python 3, but Scrapy as a framework is not usable for end users in Python 3 yet. Please wait or help us :)

kmike commented 8 years ago

Apart from that, Spyder has nothing to do with Scrapy, and you're trying to import from your scrapy.py module, not from scrapy. There are other channels to get support - we're using http://stackoverflow.com (ask a question with Scrapy tag); there is also scrapy-users google group.

stonebig commented 8 years ago

ok. I'll go to the user group. Sorry for the noise.

tonal commented 8 years ago

https://twistedmatrix.com/trac/ticket/7407 fixed https://twistedmatrix.com/trac/ticket/6197 fixed

nyov commented 8 years ago

and updated in the wiki

redapple commented 8 years ago

Basic support is planned for v1.1 And we plan to make it more robust for v1.2

vmarkovtsev commented 8 years ago

:+1:

stonebig commented 8 years ago

Great ! Is there a document that estimates the rough timeline of these two milestones ? spring 2016 and summer 2016 ?

redapple commented 8 years ago

@stonebig , we plan on releasing Scrapy 1.1 officially by the end of February 2016 (with a candidate release at least in the next few days) Scrapy 1.2 would be a couple of months after that (we hope)

stonebig commented 8 years ago

thanks a lot for this information, @redapple !

KeremTubluk commented 8 years ago

It has gone and past six days, @redapple!

:)

redapple commented 8 years ago

@KeremTubluk , we're not quite there yet: https://github.com/scrapy/scrapy/milestones/v1.1

manugarri commented 8 years ago

aand its official now. http://doc.scrapy.org/en/stable/news.html#id1

ghost commented 8 years ago

it seems that the twisted already supports py3.3+

d0ugal commented 8 years ago

it seems that the twisted already supports py3.3+

@ABSmiLT Yeah, AFAICT Twisted only recently supported 3 well enough for Scrapy. Hence all the discussion above and in the docs.

kmike commented 8 years ago

@ABSmiLT we've released scrapy 1.1rc1 with alpha-level Python 3 support about a month ago. 1.1rc2 will be released soon; it fixes several Python 3 compatibility issues we've found while testing 1.1rc1.

ghost commented 8 years ago

thanks for informing, @kmike @d0ugal looking forward to the new stable version compatible with py3