locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 360 forks source link

Use of internal accumulo API #1026

Closed jdries closed 9 years ago

jdries commented 9 years ago

In BatchAccumuloInputFormat, these lines use internal accumulo api (CredentialsHelper and TCredentials), which is no longer available in accumulo 1.6: val token = CredentialHelper.extractToken(tokenClass, tokenBytes) val credentials = new TCredentials(principal, tokenClass, ByteBuffer.wrap(tokenBytes), instance.getInstanceID) A similar issue seems to occur in MultiRangeInputSplit.

echeipesh commented 9 years ago

Thank you for pointing that out. What we are doing here is using internal implementation details to work around an issue. Even the TabletLocator, the core piece, is internal as of 1.6. It looks like tabletLocator.binRanges has changed signatures between 1.5 and 1.6 so we have to pick a version to support.

I am currently working on a PR to try to get this functionality into accumulo 1.6.2 as an optional optimization, it remains to be seen if it will make it: https://issues.apache.org/jira/browse/ACCUMULO-3602

Until that gets done we have to pick a version to support. Originally I picked 1.5.2 in preparation to collaborate with GeoMesa folks. @jnh5y, what is your plan and timeline on upgrading to 1.6?

@jdries what are your requirements? I assume you already have 1.6 deployed?

jnh5y commented 9 years ago

@echeipesh I was actually meaning to mention we tried out this code with Great Success on Accumulo 1.5.1.

For GeoMesa, I don't think we can dictate that anyone use Accumulo 1.6.2, so I'm wondering if I could ask you to commit the code to GeoMesa in an Accumulo utilities sub-project at some point.

lossyrob commented 9 years ago

@jnh5y @echeipesh I think that's a great solution; we could upgrade to 1.6.2, and then if we need to support 1.5.1 because of GeoMesa collab, then the Accumulo code would already be in GeoMesa ready for use.

jnh5y commented 9 years ago

Thanks guys! I am my way wasn't too unreasonable. Generally, I'd love to see our improvements in the upstream projects. That said, under development, I think they may need to live our codebases.

echeipesh commented 9 years ago

@jnh5y sure thing, what exactly do you need? It sounds like you already adapted it for your use? What are your improvements?

I think the reasonable thing here is to upgrade our code to 1.6.2. When/if PR makes it in we will remove it altogether.

When we need to collaborate we'll have to break out the accumulo code into yet another subproject: geotrellis-spark-accumulo-{1.5.2, 1.6.2} and version it so it can be pulled in with appropriate version of accumulo.

jnh5y commented 9 years ago

@echeipesh Sadly, I don't think I have any improvements to offer quite yet. So far, I just copied the code into a branch and passed it off to a co-worker for his project.

For IP reasons, I supposed to get you to commit the code. I'm scrambling on a bunch of other things, and I have no pressing need. It can be whenever; if the code is still in progress, I'm happy to wait.

jdries commented 9 years ago

Hi, I was simply playing around with geotrellis on our existing development cluster. I installed Accumulo from Cloudera parcels, and 1.6 seemed to be the only available version.

Jeroen Dries TAP - Centre for Remote Sensing and Earth Observation Processes VITO NV | Boeretang 200 | 2400 Mol tel. +32 14 33 55 11 | jeroen.dries@vito.be


Van: Eugene Cheipesh [notifications@github.com] Verzonden: dinsdag 3 maart 2015 21:32 Aan: geotrellis/geotrellis CC: Dries Jeroen Onderwerp: Re: [geotrellis] Use of internal accumulo API (#1026)

Thank you for pointing that out. What we are doing here is using internal implementation details to work around an issue. Even the TabletLocator, the core piece, is internal as of 1.6. It looks like tabletLocator.binRanges has changed signatures between 1.5 and 1.6 so we have to pick a version to support.

I am currently working on a PR to try to get this functionality into accumulo 1.6.2 as an optional optimization, it remains to be seen if it will make it: https://issues.apache.org/jira/browse/ACCUMULO-3602

Until that gets done we have to pick a version to support. Originally I picked 1.5.2 in preparation to collaborate with GeoMesa folks. @jnh5yhttps://github.com/jnh5y, what is your plan and timeline on upgrading to 1.6?

@jdrieshttps://github.com/jdries what are your requirements? I assume you already have 1.6 deployed?

� Reply to this email directly or view it on GitHubhttps://github.com/geotrellis/geotrellis/issues/1026#issuecomment-77028246.

VITO Disclaimer: http://www.vito.be/e-maildisclaimer

echeipesh commented 9 years ago

That brings up an excellent point. Our plan is to target Cloudera and later HDP platforms. If they're going to support accumulo 1.6, we need to support it as well. I'll put up a PR to address this issue and I'll submit a PR to GeoMesa for 1.5.

mparker4 commented 9 years ago

This module (spark/accumulo connector code) is exactly what I've been looking for for a while, but I need it to work on 1.6. I ended up writing my own, but this looks more full-featured. Any thoughts on totally disconnecting the spark/accumulo connector code from geotrellis and making it is own project? Then anyone could include it and it could be versioned separately. Also, it could be posted at http://spark-packages.org/ for visibility (hbase has multiple connectors there).

echeipesh commented 9 years ago

@mparker4 What aspects are you interested in? We talked about it originally so it could be on the table to separate relevant code out.

I assume you are talking about the general problem of pushing a RDD[T] to Accumulo and getting it back again. There would be some trickery as query tools are pretty low level. We have coded an interface that makes sense for the problem of storing tiles, that would need to be generalized.

mparker4 commented 9 years ago

@echeipesh basically making a general purpose connector from accumulo to RDD. Obviously we can all just use Hadoop input formats, but its messy and not in the style of scala/spark. In some cases, its also very inefficient (which is why you have done some of the extra work you have). Basically have a one-stop-shop for all things accumulo and spark connection related so they are captured in one place. Then wrap it all up with a nice api. Versioning is also easier because you can have different versions available and just call the appropriate one as a dependency. Individual projects can extend as needed in their project code.

I'm just envisioning something like other formats have here: http://spark-packages.org/, and hopefully (eventually) even more full featured and as efficient as possible for different use cases.

mparker4 commented 9 years ago

@echeipesh Just FYI, I'll be out of town/offline for a while. I'll check back when I get back.

lossyrob commented 9 years ago

@echeipesh what's the state of this issue? Can it be closed?

echeipesh commented 9 years ago

We must use internal API for what we are doing. We upgraded dependency to Accumulo 1.6, which addresses this issue. In the future the io packages will need to be broken out into sub-projects so supporting multiple versions of Accumulo could be on the table.