nasa-jpl-memex / nutch-python

Python port of Nutch that allows controlling Apache Nutch via its REST API.
http://nutch.apache.org/
Apache License 2.0
5 stars 2 forks source link

Refactor and update to new 1.x API #3

Closed ahmadia closed 9 years ago

ahmadia commented 9 years ago

Verified functionality:

Unverified functionality:

ahmadia commented 9 years ago

cc @chrismattmann. I'll make better progress now that I've got a better handle on this code and its interactions with the server.

chrismattmann commented 9 years ago

Overall great work. I think you can commit this, but I'd appreciate making crawl not interactive. Can you also issue the same PR to my repo, or I can do that too I'm trying to keep them in sync, but maintain a delineation for my class.

ahmadia commented 9 years ago

If you don't mind, I'll get a few more commits in so that it's more or less full-featured before I merge in.

ahmadia commented 9 years ago

Probably by the middle of next week.

chrismattmann commented 9 years ago

I'm going to go ahead and commit what you have here to my chrismattmann:master. Makes it easier to play with it and test.

chrismattmann commented 9 years ago

I made some fixes in https://github.com/memex-explorer/nutch-python/commit/0bc86d2f9c422abf112a398d25263871b376ccb5

ahmadia commented 9 years ago

Thanks!

ahmadia commented 9 years ago

Another refactor, this time I'm adding a low-level Server class to replace the callServer function. I'm also trying to make the interface look more like elasticsearch-py, where possible.

chrismattmann commented 9 years ago

Quick thoughts - why keep refactoring? I mean it was functional right?

ahmadia commented 9 years ago

It was really hard to work with from Python, and there were a number of things that weren't working due to the changes in the REST API.

chrismattmann commented 9 years ago

Hmm, well I think these are two orthogonal things. If there were bugs we fix those bugs. But refactoring it to look like ESpy is aesthetic and I'm not getting it and unfortunately not supportive of doing that. I'd rather it look like tika-python and thus callServer I think is something we should keep around, etc. We can talk about this more at the hackathon.

ahmadia commented 9 years ago

A simple case is the output from listing /jobs. In the current version, that returns a bunch of JSON and a status code. In the refactored version, a non-200 status raises an exception and the response is converted into a list of Job objects, which in turn have methods associated with them.

If you want to keep callServer around, I can provide a similar interface that wraps the Server.call() function (or alternatively, pull callServer back out and have Server.call() wrap it). My main motivations for providing a Server object is that I want to be able to pass information about how to call Nutch around between the various sub-objects, and I wanted to include bits of context like "raise errors on non-200 status codes" that could be easily set in the top-level Nutch object then forgotten about.

Some of this is aesthetic, and some of it has to do with encapsulating the JSON into usable code. Let's chat more about this at the hackathon.

ahmadia commented 9 years ago

@chrismattmann - Are you still feeling unhappy with this? I can pull the callServer code back out if it's the only barrier to merging. Otherwise I can keep working on this branch and we can discuss at the hackathon. I'm fine with merging in whatever functionality you like and discarding the rest.

ahmadia commented 9 years ago

@chrismattmann - this is ready for review/merge.

chrismattmann commented 9 years ago

I love this. I am +1 to merge, and thus am merging it. You are awesome @ahmadia thank you.