nasa-jpl-memex / nutch-python

Python port of Nutch that allows controlling Apache Nutch via its REST API.
http://nutch.apache.org/
Apache License 2.0
5 stars 2 forks source link

Update example to enable two simul-crawls #15

Closed ahmadia closed 8 years ago

ahmadia commented 8 years ago

This is based on work done in https://issues.apache.org/jira/browse/NUTCH-2132

In particular, this relies on the second patch implementing configurable routing keys.

ahmadia commented 8 years ago

simultaneous_nutch_crawls

chrismattmann commented 8 years ago

@ahmadia I'm trying to figure out whether or nor I should try and pull this example into https://github.com/chrismattmann/nutch-python.git My feeling here is no, since this requires an AMQ server, and consumer/producer stuff and stuff that you set up with conda. And we are trying to keep nutch-python minimal, etc. I have cherry-picked:

[chipotle:~/git/nutch-python] mattmann% history | grep cherry-pick
    10  21:36   git cherry-pick e1c2c9ae0bf402574a237faa33c7c2a528e39cf1
    13  21:37   git cherry-pick 5abef32432c5fadaa553be48f9ef0c69f2bd9054
    14  21:37   git cherry-pick b0cdd1150915002671b652695a657ac0753fc6e5

And am pushing upstream to master on my origin repo. Thoughts?

ahmadia commented 8 years ago

I'm strongly opposed to cherry-picking commits. The problem is that cherry-picks preserve content but they don't preserve the commit itself. Once you've started cherry-picking you can only cherry-pick, so you're basically forking the two lines of development.

I put the example into a subdirectory with its own environment.yml, it doesn't touch the top-level requirements of nutch-python at all, and the example itself is fairly small. The Python side of the code is also agnostic to the type of AMQP server running, so as soon as we have an ActiveMQ interface we can switch the conda requirements to that.

ahmadia commented 8 years ago

@chrismattmann - If you're really uncomfortable with the example as is, do you have suggestions for how we can keep it in the repository? It seems like slightly overkill to create a new repository/package for it, so maybe a Gist.

chrismattmann commented 8 years ago

oh yeah I'm not uncomfortable at all to be honest. I was just trying to figure out whether or not to include it in nutch-python - and it sounds like we should keep here in memex-explorer downstream of that but I don't think it needs to be upstream in my repo. Thanks.

As for cherry picking it's all I could do to figure out the right sequence of commits without pulling in the example which I don't think needs to be in the nutch-python core library (I get it since we're using it in memex-explorer, but doesn't need to be in the minimal python lib). So dunno of a different way to be able to apply patches from what you did here (aka bug fixes and needed stuff). I'm also not opposed to cherry-picking so to each their own :) :+1:

ahmadia commented 8 years ago

and it sounds like we should keep here in memex-explorer downstream of that but I don't think it needs to be upstream in my repo. Thanks.

There is no downstream and upstream once you've started cherry-picking. The two repositories no longer share history, and the only way to move content between them is cherry-picking.

ahmadia commented 8 years ago

They effectively fork and become two different projects.

ahmadia commented 8 years ago

I'll just pull this example out of this repository. I'd rather have our lines of development synchronized than this example in here.

chrismattmann commented 8 years ago

@ahmadia with history rewrite and all the other features of git, making absolutist statements like they are two different projects and they no longer share history, etc etc., when they've shared history to this point is unhelpful.

Regardless, yeah I think the thing is you are thinking about nutch-python from a conda perspective which can include all sorts of other whiz-bang packages and dependencies which is great. For me I see nutch-python as a small pip library/thin REST client to Nutch. Modeling after Brian Wilson's mantra, "zero install, baby!"

So anyways, yeah, we'll keep it in sync. This is a social thing rather than a technical thing. I'm not looking for the technical answer here - it sounds like the we will commit socially to working together and making sure the stuff is in sync. That's all I was looking for. You've been doing amazing work! I really love the bokeh viz and streaming stuff. To be honest I'm more upstream in nutch-python internals than having a chance to test it at that level. I need to find some time for that. Hopefully by the QPR. Cheers.

ahmadia commented 8 years ago

Sorry @chrismattmann - I meant they technically no longer share history, not socially :) Easier to show with a diagram when we're in person. Let's table the discussion until then.