mdrakiburrahman / rakirahman.me

💻Personal blog powered by Gatsby
https://www.rakirahman.me
MIT License
1 stars 1 forks source link

Hello. Idea for a sample pipeline #2

Closed evogelpohl closed 3 years ago

evogelpohl commented 3 years ago

I enjoyed reading your post on the databricks autoloader feature. I'm tinkering with a similar pipeline to process AVROs coming from Azure Event Hub (Capture -> to /ADLSGen2).

I'm not finding a lot of sample pipelines that follow the path I'm attempting.

  1. Use Databricks' AutoLoader, Create df w/ ReadStream
  2. Isolate just the [Body] column from the AVRO files - which is in Binary format
  3. Using from_avro function of pyspark.sql.avro.functions & a AVSC file (a simple file, not registered in the schema registry) for the Body's schema.
  4. Fully flatten the Body schema (as it contains 1 nested struct)
  5. WriteStream into Delta, Trigger=Once, as a fully flattened table (-> then CREATE TABLE [...], optional)

I suspect many would benefit from a working demo as Azure Event Hub w/ Capture are common patterns. If you're so inclined to make one, then thanks in advance. -EV

mdrakiburrahman commented 3 years ago

@Evogelpohl - thanks for the fantastic idea. I've had this in my "to do" list as well, but haven't had a chance to implement it yet. Pulling this up on the list - will tag you with the post here once I have something working.

mdrakiburrahman commented 3 years ago

@Evogelpohl - here's the article that covers this topic.

Thanks for the idea once again, and feel free to reopen this issue if you have any questions!