scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.28k stars 1.41k forks source link

No Items scraped - Portia - Download as Scrapy project #738

Closed stephenrt42 closed 7 years ago

stephenrt42 commented 7 years ago

Hello,

I recently got Portia up and running on my windows 10 box using docker. Currently, I can create a portia project using the web interface and run it at the commandline and everything works great and I can export my scraped items to a json file.

Problem is when I download the portia project using the feature under the web interface "spider -> download as scrapy", the scrapy version doesn't return any items when I run it, all I get is "Debug Crawled (200)...", but no items are actually scraped.

I was thinking maybe I forgot to click something in the web interface or change a setting in my scrapy settings file, I'm at a loss as to why this isn't working; especially, since it works in Portia but not in scrapy.

Also, I tried deploying the portia project files to my scrapyd server, got the same thing, "Debug Crawled (200)..." but nothing scraped.

Any ideas what might be missing???

S.

ruairif commented 7 years ago

Did you try running the other download version? You can run that one using the command:

docker run -i -t --rm -v <PROJECTS_FOLDER>:/app/data/projects:rw <OUPUT_FOLDER>:/mnt:rw -p 9001:9001 scrapinghub/portia \
portiacrawl /app/data/projects/PROJECT_NAME SPIDER_NAME -o /mnt/SPIDER_NAME.jl

If you upload your project here (the one in the folder) I'll try to see why it isn't generating a scrapy project correctly.

stephenrt42 commented 7 years ago

Hello ruairif,

Thanks for responding. Yes, I can run the above code and export. Although, since I'm using windows 10 and not an Ubuntu environment, some things are slightly different. Actually, I'm working on setting up a dual boot on my laptop because Windows 10 isn't proving a good environment for this type of work.

Anyway, here's my code for scraping in Portia on windows 10:

docker run -i -t --rm -v /c/Users/asus1/documents/portiaprojects/projects:/app/data/projects:rw -v /c/Users/asus1/documents/portiaprojects/exportfeed:/mnt:rw -p 9001:9001 scrapinghub/portia /app/slybot/bin/portiacrawl /app/data/projects/proxySite1 spys.ru -o /mnt/spys.ru5.json

You'll notice that I had to add two -v arguments, since windows cmd line didn't accept just the -v argument by itself. Either way, the outcome is the same, I can run portia on cmd line and scrape a site and my project files and output json (scraped items) are stored locally on my harddrive.

To give you some background, I'm attempting to scrape a website that contains a list of free proxies. My intention is to use portia to quickly create scrapy projects and run them through scrapyd server, that way I can schedule a batch process to do this on a daily basis. I also way to use portia as a tool to create quick scrapy projects.

You'll have to excuse me, I just got docker installed last light and portia all setup, so I'm new to this docker stuff. Also, I haven't use git that much, except to clone projects and test them out.

How would you like me to upload my project?

Thanks again,

Stephen

stephenrt42 commented 7 years ago

Hello ruairif,

Ok, did some reading and testing. It appears that there's something odd going on when it comes to portia and how it extracts the "Download as scrapy" project files.

Here's what I know so far.

  1. When I create a project in Portia and run it on cmd line, it extracts the items I selected and works: docker run -i -t --rm -v /c/Users/asus1/documents/portiaprojects/projects:/app/data/projects:rw -v /c/Users/asus1/documents/portiaprojects/exportfeed:/mnt:rw -p 9001:9001 scrapinghub/portia /app/slybot/bin/portiacrawl /app/data/projects/proxyCrawl spys.ru -o /mnt/spys.json

  2. Now when I "download as scrapy" and look at the main spider script in the project file, I noticed the following: items = [ [ Item( PortiaItem, None, u'table:nth-child(3) > tbody > tr:nth-child(4) > td > table > tbody > tr:nth-child(n+4):nth-child(-n+33), table:nth-child(3) > tbody > tr:nth-child(4) > td > table > tr:nth-child(n+4):nth-child(-n+33), table:nth-child(3) > tr:nth-child(4) > td > table > tbody > tr:nth-child(n+4):nth-child(-n+33), table:nth-child(3) > tr:nth-child(4) > td > table > tr:nth-child(n+4):nth-child(-n+33)', [ Field( u'ipAddress', '', []), Field( u'type', '', [])])]] As you can see the items ipaddress and type have no annotations set, so I think the annotations are getting losts somewhere down the line.

Ok, I've attached a copy of my project, please take a look at it when you can.

Thanks,

Stephen proxyCrawl.zip proxyCrawl_download_as_scrapy.zip

ruairif commented 7 years ago

It looks like due to the page structure (nested tables and strange offsets) portia2code is having trouble creating the correct selectors for that page. I don't have time to make it handle them correctly at the moment so if you could make a PR so it can handle the structure that would be great

stephenrt42 commented 7 years ago

Hey ruairif,

Although I'd love to help, I'm really new to the python language (2 months in). I'm afraid that this issue will have to be postponed to a later time.

Either way, thank you for your help. It's appreciated and the work you guys are doing here is totally awesome. I hope in the near future my python/json skills are better, so I can contribute to the great work that's being done here.

Cheers,

Stephen

Landry009 commented 6 years ago

bonjour je suis nouveau sur scrapinghub portia donc je l'utilise pour la boite ou je bosse sauf qu'au paravent je ferais sans problème mes webs scraping avec portia en ligne sauf qu'a présent lorsque je crée mon spyder et que je sélectionne mes ITEMS il m'affiche une autre information non sélectionner et lorsque je lance mon spyder il ne revoit ou ne scrappe rien en retour merci d'avance