nasa-jpl-memex / memex-explorer

Viewers for statistics and dashboarding of Domain Search Engine data
BSD 2-Clause "Simplified" License
121 stars 69 forks source link

Allow user to specify how many rounds nutch will crawl. #439

Closed brittainhard closed 9 years ago

brittainhard commented 9 years ago

Right now it runs indefinitely. This is not necessarily good, and the user has no idea about the progress of the crawl. We should allow the user to specify how many rounds they want, and/or allow them to see how many rounds have been run, and manually restart a round.

lewismc commented 9 years ago

In Nutch this is not true. Have you investigated the features of Nutch which limit

Maybe the functionality is just not implemented within memex-explorer

asitang commented 9 years ago

I agree with Lewis. These features are already there in Nutch.

brittainhard commented 9 years ago

Let me clarify @lewismc @asitang I was not implying that Nutch keeps running indefinitely, but rather that we currently have the crawler running indefinitely.

If you look here https://github.com/memex-explorer/memex-explorer/blob/master/source/apps/crawl_space/crawl_runners.py#L210-236 , you can see that the crawler currently runs on a loop, starting a new round each time.

I am currently in the process of refactoring this code in order to provide a much better interface to the crawler.

lewismc commented 9 years ago

Sounds good @brittainhard, if you need any assistance let me know. Having looked at some of the code it appears that some additional arguments are required in order to express number of rounds of fetching rather than the crawler entering a continuous crawl cycle.

brittainhard commented 9 years ago

That's the idea. So if you look here: https://github.com/memex-explorer/memex-explorer/blob/master/source/task_manager/crawl_tasks.py#L46-61

This is the new crawler code that is going to be integrated. It takes rounds as an argument and passes that argument to the subprocess. I plan on having a UI element that will allow a person to enter the number of rounds they want before starting the crawl. The default argument is one (which reminds me, I need to change the argument to be an integer which is converted to a string, not a string).

lewismc commented 9 years ago

+1

On Thursday, May 7, 2015, Brittain Hard notifications@github.com wrote:

That's the idea. So if you look here: https://github.com/memex-explorer/memex-explorer/blob/master/source/task_manager/crawl_tasks.py#L46-61

This is the new crawler code that is going to be integrated. It takes rounds as an argument and passes that argument to the subprocess. I plan on having a UI element that will allow a person to enter the number of rounds they want before starting the crawl. The default argument is one (which reminds me, I need to change the argument to be an integer which is converted to a string, not a string).

— Reply to this email directly or view it on GitHub https://github.com/memex-explorer/memex-explorer/issues/439#issuecomment-100054955 .

Lewis

chrismattmann commented 9 years ago

This new interface should match the rest API @asitang please work with @brittainhard on this