scalingexcellence / scrapybook

Scrapy Book Code
http://scrapybook.com/
475 stars 209 forks source link

Chapter 9 pipeline to mySQL not working/hangs/etc #28

Closed johnscrapetest87 closed 7 years ago

johnscrapetest87 commented 7 years ago

Hello, I ran a few tries at this Chapter 9 and nothing seems to work. I was running the examples on pages 159 -162, where a pipeline is set up for insert into a mysql database. I was able to run mysql in the VM dev environment, setting up tables is fine, no problem....

when running scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000 the spider kicks off but hangs forever. forces me to remove the vm connection, and start all over from scratch. I recall you altered a port for mysql to help with the Windows10 bug / VM but i'm not sure that is the problem. please confirm running the code from the folder ch09 writes sucessfully to mysql. I am running the same file from the same folder and my spider hangs forever......

thanks -John

ps: many of your book examples, in both chapter 3 and chapter 4 have different source html code on the :web:9312 as to what's in the paperback textbook. Even the Appery.io website is different than the pictures in the textbook, there is no startscreen tab and no data tab after setting up your account . Not sure what is the issue with that one.

lookfwd commented 7 years ago

recall you altered a port for mysql to help with the Windows10 bug / VM

I just disabled proxying to your host because you were already running MySQL there and there was conflict. So this shouldn't be the issue regarding ch09.

please confirm running the code from the folder ch09 writes successfully to mysql

I can confirm that ch09 works, as seen on this new video here, at 2:30.

image

I think that it likely has to do with Elastic Search and Garbage Collection in your laptop that has only 2 CPUs. Try disabling the ES pipeline by commenting-out the relevant line in settings.py.

many of your book examples, in both chapter 3 and chapter 4 have different source html code on the :web:9312 as to what's in the paperback textbook

Do you mean that you expected to see an array with a price number but you don't get one, or that you expected a title and you don't get one? Exact values likely differ (they should be exactly the ones seen on the video above) but I wouldn't expect this to cause problems from the perspective of Learning Scrapy.

Even the Appery.io website is different than the pictures in the textbook

I'm not surprised. From page 64:

image

Appery.io is completely beyond my control and they changed their layout in the meanwhile. I want this Chapter in the book because I feel it is motivating to startup audiences who can see that with very little coding, some Scrapy and an online service they can have a demo-able Minimum Viable Product very quickly. It was inevitable that at some point they would change the layout and this is why I kept this chapter small - just 13 pages. Here is another video I just made, with step-by-step instructions for appery.io's new layout. It's fast, but it's ok to pause and resume.

image

johnscrapetest87 commented 7 years ago

Wow, that was super helpful. Your book has wonderful examples and is set up well for real world testing. But for all the beginner Scrapy peeps out there following the chapters is frustrating because there are so many changes with python versions, websites, html source code, page content updates.
You have done an excellent job adressing all these differences but to the regular person working thru chapter by chapter confusion gets very frustrating. Just look at the very large diffeences in chapter 4 that the video covers. I gave up on that chapter after 15 minutes.......that said...the book is an excellent work but you should add links to your scrapybook homepage for "chapter 2 changes", "chapter 3 changes", "chapter 7 changes."..... etc.... that summarizes the old info (from the text book) and the new info, for example the new "SCOPE tag" on appery.io....... that would greatly help with re-assurance and confidence as I plug thru each set of code chapter by chapter. I guess it would be like "live links to book errata"....but its not really errata because the content is dynamically changing over time, its not necessarily wrong text in the texxt book or typos.....

again, thanks for replying so quickly and so comprehensively...........

I'v decided to download the vagrant vm and virtualbox software to my desktop Win8 system which will run with fewer problems / surprises/etc.....-J

lookfwd commented 7 years ago

I gave up on that chapter after 15 minutes.

That's ok. - one chapter ... not a big deal. I don't cry over it. I still believe it contributes as an inspirational chapter despite the fact that, without the video above, it might not be that usable anymore, unless someone is somewhat experienced with web.

Sould add links to your scrapybook homepage for "chapter 2 changes", "chapter 3 changes", "chapter 7 changes."

I've put the new videos today on the home page. It's not that bad. I don't know where the expectation that when I type something I should get exactly the value that is on the printed book, comes from... For me it's enough that when one crawls for a price, (s)he gets a price and not a title. That's the goal. Do I say that this book is about "the price on the 2nd page being $392.33"? I don't think so. Go to the URL for that page, see the price and the title. If the price and the title you crawl with the Scrapy code I give isn't the same, then that is a problem.

that would greatly help with re-assurance and confidence as I plug thru each set of code chapter by chapter.

I agree... I am glad I got similar feedback to the one you give me early while writing the book so I solve to great extend the problem, of which you just got a tiny taste - and got frustrated.

I would like to ask you, on a range between 1-10 how frustrated are you with this book and on a range between 1-10 how frustrated are you with the average book out there? Because if you got frustrated with this book, I guess you are outraged with the average book out there and you actually never attempt to try to run anything because you are discouraged by the first few pages after the introduction, isn't it like that? As soon as they instruct you to e.g. install ElasticSearch v4.1 which you have to spend 20 minutes to find the link to, because it's not available anymore, just to find out that any code examples assume Mac, while you're running on Windows and that you now have to also install Cygwin and guess how to convert all the examples because you're actually on your own... how does this typical book experience make you feel? Maybe this book, then, has a forum, where you go and you find similar outraged people... and the guy who posted something that "worked for me", which, desperate, you try... but it doesn't "work for you", since it's 6 months later and the guy who "worked for me" was using a system patched like this and that... so even more frustration and wait time in forums, in order to just get the first example to run. That is the typical experience, isn't it?

Just to be fair with this book, you would have to compare it with everything else out there. Because apart from very tiny few books that work on a very abstract domain... every book out there is broken. I mean the entire code is broken... not just single pieces here and there! Maybe books on programming languages are a bit more stable because languages change more slowly and try to be somewhat backwards compatible, partly because authors of books on them, run all the time on committees, and shout and push back on (potentially useful) features that "they don't believe at", which coincidentally would also break their books. I think it's unfair to compare this book with books on programming languages. The majority of books that describe real-world, evolving, community frameworks, are worthless by the time they go on print. Impossible to reproduce. In this case, I tried to provide a book where examples will work, now and in the future, because I care for the reader and the community, and to a great extend I delivered. Writing a book for a web scraping framework is, of course, even more difficult, when compared to a random framework, because web changes all the time... and Scrapy aims to crawl web... but still, it works! And keep in mind that for every reader, in order for them to download the virtual machines reliably and relatively fast, no matter if they are in US or China, I pay the best hosting available, which means that I pay ~$2 per reader, both for the few legal ones and the majority of illegal ones... and I won't talk about the non-existent royalties. It doesn't make any financial sense... but still I hope that the Scrapy community benefits and people learn how to use this great tool.

johnscrapetest87 commented 7 years ago

Yes, I agree with all of your comments. Because the code and the platforms are changing so quickly it is difficult to capture everything....There are almost no good books out there for scrapy except this one, you have done a very good job with this. The other web tutorials out there on Scrapy are not good at all . As for frustration, I feel the general python books like "Introduction to Python " or "Automate the Boring Stuff with Python" are easier to read and would rate maybe a 2 or 3 for frustrating....but their universe is only to run Python 2.7 or 3.6....it either works or it does not...... Your set up is much more complex with the VM but you have done an excellent job with it..... I guess all in the frustration level of your book is around a 4-5...... and the crappy scrapy web tutorials out there by others (even the one on scrapy.com !) are like a 7 or 8. .......

Again, its a difficult subject matter to teach...you have done an excellent job..... I'm sure this will get better in the next edition or ebook update.
I will surely recommend the text to others but in the future if theres a way to just run all the book examples from a simple python 2.7.13 install (or python 3.6 in the future) and hitting a simple html page/pages on one of your practice web host servers that would be preferable.....I want to spend my time and use my brain cells on reading your excellent content and learning the details of scrappy not learning all the details of VM hardware and win8 vs win10 problems etc etc.....its a bit of a distraction but the learning curve is well worth it.....thanks again

lookfwd commented 7 years ago

Thank you very much!! :) What I would really love to see is the Image size going down to 1/3 of the size with some custom distribution or other optimizations. I think the long download time and the inherently slower VM start/stop times are very annoying. If everything was faster, I think, most of the friction would be gone. P.S. There's a way to make the results of the first chapters exactly like the book. I will do it when I find a few hours.