hortonworks-spark / spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas
Apache License 2.0
263 stars 149 forks source link

Integration with external Hive Metastore(HMS) and collaboration #260

Open mwiewior opened 5 years ago

mwiewior commented 5 years ago

Hi All, we at ING are working integrating SAC with our internal data analytics platform where Atlas is one of our main components for storing various kinds of metadata, metrics, quality information, etc. Right now we are struggling with the proper integration of HMS, SAC to get column-level lineage. After some of the most recent PRs being merged, we faced 3 main issues that we would like to discuss with you and ask for assistance for setting/implementing them correctly:

  1. How to properly integrate HMS hook with SAC. Is HMS 3.x supported - do you maybe have some kind instructions on how to.
  2. Is it safe to assume that all dependent hive entities are created before spark_process and we do won't run in any race conditions?
  3. How about column entities - any plans for supporting them - maybe we can contribute here? Apart from that:
  4. Storing dataset/column level metrics - maybe again you have sth on your roadmap so that we can help with that or together work on solution? Since we would like really to get involved in SAC development - could we perhaps schedule/organize a telco with you, Guys so that we can elaborate more on what our plans and needs are and how we can join efforts on extending SAC.

Regards, Marek

HeartSaVioR commented 5 years ago

Hi, in overall please refer #262 on explaining Spark models vs Hive models.

  1. How to properly integrate HMS hook with SAC. Is HMS 3.x supported - do you maybe have some kind instructions on how to:

Please refer below doc (Atlas official doc) to set up Hive hook. https://atlas.apache.org/Hook-Hive.html

If things are not working as expected, you may also want to set up below configuration to hive-site.xml as well. (Unfortunately HMS hook is available in Atlas 2.0, so you may need to consider upgrading, or try out building Atlas manually and get artifacts from there.)

Name : hive.metastore.event.listeners
Value  : org.apache.atlas.hive.hook.HiveMetastoreHook
  1. Is it safe to assume that all dependent hive entities are created before spark_process and we do won't run in any race conditions?

Query listener gets event when query is finished, so HMS always gets chance to put entities to Atlas first. HMS and SAC puts asynchronously to the Atlas so we might still wonder about race condition - we haven't see the case yet, but if I'm not missing anything, Atlas team is planning to deal with "out of order" case. Once we put events in ATLAS_HOOK topic in Kafka, we can always replay the event - not that critical. Let's defer until we encounter actual case.

  1. How about column entities - any plans for supporting them

If you meant tracking column entities for Spark models, we don't have plan to restore it as of now. (Since it was dropped due to technical reason.)

If you meant supporting column "lineage" on "spark_process", we're planning to address it.

  1. Storing dataset/column level metrics

We don't have any plan for this. For now, SAC is tightly coupled with our own ongoing milestone, so I'm not sure we can spend time to review next plan.

To be honest, I think we messed up license related thing as well, given I can't find any CLA requests on previous PRs outside of CLDR (CLDR includes legacy HWX, for sure). Until we finally sort out (should take time), we may not be able to receive contribution from outside. (I think we should change header to CLDR one as well.)

bolkedebruin commented 5 years ago

Hi @HeartSaVioR

Thanks for your answers. I’m a bit surprised on your last remark though. We’d like to cooperate, we have provided patches already and I dare say we have the competency to do so but what you say could trigger us to ‘just fork’ the project. I don’t think that is in our mutual interest?

A cla in Apache style is fine to us so if you require that we are happy to do so.

Cheers Bolke

HeartSaVioR commented 5 years ago

I'm sure all of you in ING made a great contributions. Thanks! I'm not saying about the quality or so. IANAL, but you could find easily from other projects requiring CLA and it's all about legal issue.

I thought we received CLA when I merged patches from yours, since other folks merged the patch before. That's completely my bad. Sorry about that. I believe Cloudera has its own CLA - I could find one from old Livy, but need to confirm this is still valid as of now. It would take some time to confirm and get back to you. Sorry for any inconvenience.

Btw, end users (and contributors) may need to know that SAC may not be going to represent general usage, and the proposal may not be accepted if the direction is not in line with Cloudera's direction on SAC.

HeartSaVioR commented 5 years ago

UPDATE: we are working with legal team to formalize contribution guide with ICLA/CCLA as well. Once it is ready, I'll update here again. Thanks again for your patience.

HeartSaVioR commented 4 years ago

@mwiewior @bolkedebruin Sorry about taking so long time!

We've worked with legal team and got official forms of ICLA/CCLA. Given you're contributing on behalf of ING, I believe we request CCLA to fill in. Please ensure filling project name as "Spark Atlas Connector".

Cloudera CCLA_25APR2018.pdf Cloudera ICLA_25APR2018.pdf

Some guidance from legal team is, schedule A must be filled in and schedule B should not apply.

As the document guides, please fill up the form and send to cla@cloudera.com. It would be better to notify me after sending the form. Thanks again!