nasa-jpl-memex / sce

Sparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git
http://irds.usc.edu/sparkler/
Apache License 2.0
4 stars 3 forks source link

Add/remove urls from the deep crawl list #19

Open wmburke opened 7 years ago

wmburke commented 7 years ago

How do we add/remove urls from the deep crawl list?

wmburke commented 7 years ago

How do you want this to work? I'm asking to make it as easy as possible on you guys. We have several capabilities we need to implement here:

  1. add initial seeds - this could be a long list of seeds or just one
  2. add additional seeds
  3. mark/unmark seeds for deep crawling

Where are we going to store this data? Should it be in the db directly or should it be in a text file? When we add a new seed, we need the api capability to inject the seed.

Let me know what is required to do these things and how you think will be best and I'll work on the interface.