scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.28k stars 1.41k forks source link

Urgent quick change on UX #690

Closed glemiere closed 7 years ago

glemiere commented 7 years ago

Hi! I really Love Portia, this software as a very good potentials! However, a little detail mess everything up.

When you add some URLs, each url generate one button. But creating 200k buttons in one time is just destroying the video memory and crash the browser, and as you guess, being able to crawl more than 10 pages is clearly vital. This modification would be pretty fast to make : just don't append any button.

If you could also add a very quick system to allow us to parameter urls. Like being able to say : id from 1 to 9999999 instead of giving a huge list of almost identical links. But fix that browser crashing problem first ;)

Thank you!

ruairif commented 7 years ago

If all of your ids a sequential then we already have you covered. You can create what we call a generated url: To start creating a generated url fist you need to press the down arrow beside the START URLS section. generation_url When the dropdown opens you click on Add generation url This will bring you to the create generation url window. generation_view This window allows you to build up a url from different pieces. Here is an example of a generated url that will start with all listing pages on the site using a range. generation_options

If a generated url doesn't work for you why not try a feed url. You can start creating a feed url in the same way that you would make a generated url. You press the down arrow beside the START URLS section and select Add feed url This will open the create feed url window. screenshot from 2017-01-20 09 18 06 All you need to do is to provide a link to a file containing urls structured like this and whenever your spider starts it will download the file and start with the urls that it contains. This means that you can update the start urls for your spider without modifying it.

Hopefully this helps. As a rule if you need more than 10 start urls you should use generated and feed urls instead.

glemiere commented 7 years ago

Interesting, my interface doesn't look like that. I probably didn't install it correctly.

screen shot 2017-01-20 at 19 47 41
sagelliv commented 7 years ago

@glemiere That is the previous Portia version. For a more up to date version, check out the develop branch

glemiere commented 7 years ago

Here was my mistake! Awesome thank you ;-)