pinot-contrib / pinot-docs

Apache Pinot Documentation
23 stars 151 forks source link

Add info on specifying avro schema location #227

Closed matthewhelmke closed 11 months ago

matthewhelmke commented 11 months ago

This fixes https://github.com/apache/pinot/issues/9990

This is not an ideal fix, but it is adequate as a MVP en route to future intended work on cleaning up this section of the documentation. So, I'm pushing a kludge in lieu of retaining a gap, but I consider this is a temporary fix.

ksnijjer commented 6 months ago

@matthewhelmke Just checked the docs https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/import-from-apache-kafka#tell-pinot-where-to-find-an-avro-schema and we need to reword it little more clearly. Few suggestions for clarity: -More appropriate section name could be "How to configure Avro Schema manually if not using Schema Registry" vs current -Steps would be use the standalone tool to generate config or user can manually create JSON blurb (escaping quotes etc.) -In the config let's provide a real eg. (we have it in QuickStart) "stream.kafka.decoder.prop.schema": "{\"type\":\"record\",\"name\":\"Flight\",\"namespace\":\"pinot\",\"fields\":[{\"name\":\"DaysSinceEpoch\",\"type\":[\"int\"]},{\"name\":\"Year\",\"type\":[\"int\"]},{\"name\":\"Quarter\",\"type\":[\"int\"]},{\"name\":\"Month\",\"type\":[\"int\"]},{\"name\":\"DayofMonth\",\"type\":[\"int\"]},{\"name\":\"DayOfWeek\",\"type\":[\"int\"]},{\"name\":\"FlightDate\",\"type\":[\"string\"]},{\"name\":\"UniqueCarrier\",\"type\":[\"string\"]},{\"name\":\"AirlineID\",\"type\":[\"int\"]},{\"name\":\"Carrier\",\"type\":[\"string\"]},{\"name\":\"TailNum\",\"type\":[\"string\",\"null\"]},{\"name\":\"FlightNum\",\"type\":[\"int\"]},{\"name\":\"OriginAirportID\",\"type\":[\"int\"]},{\"name\":\"OriginAirportSeqID\",\"type\":[\"int\"]},{\"name\":\"OriginCityMarketID\",\"type\":[\"int\"]},{\"name\":\"Origin\",\"type\":[\"string\"]},{\"name\":\"OriginCityName\",\"type\":[\"string\"]},{\"name\":\"OriginState\",\"type\":[\"string\"]},{\"name\":\"OriginStateFips\",\"type\":[\"int\"]},{\"name\":\"OriginStateName\",\"type\":[\"string\"]},{\"name\":\"OriginWac\",\"type\":[\"int\"]},{\"name\":\"DestAirportID\",\"type\":[\"int\"]},{\"name\":\"DestAirportSeqID\",\"type\":[\"int\"]},{\"name\":\"DestCityMarketID\",\"type\":[\"int\"]},{\"name\":\"Dest\",\"type\":[\"string\"]},{\"name\":\"DestCityName\",\"type\":[\"string\"]},{\"name\":\"DestState\",\"type\":[\"string\"]},{\"name\":\"DestStateFips\",\"type\":[\"int\"]},{\"name\":\"DestStateName\",\"type\":[\"string\"]},{\"name\":\"DestWac\",\"type\":[\"int\"]},{\"name\":\"CRSDepTime\",\"type\":[\"int\"]},{\"name\":\"DepTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"DepDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"DepDelayMinutes\",\"type\":[\"int\",\"null\"]},{\"name\":\"DepDel15\",\"type\":[\"int\",\"null\"]},{\"name\":\"DepartureDelayGroups\",\"type\":[\"int\",\"null\"]},{\"name\":\"DepTimeBlk\",\"type\":[\"string\"]},{\"name\":\"TaxiOut\",\"type\":[\"int\",\"null\"]},{\"name\":\"WheelsOff\",\"type\":[\"int\",\"null\"]},{\"name\":\"WheelsOn\",\"type\":[\"int\",\"null\"]},{\"name\":\"TaxiIn\",\"type\":[\"int\",\"null\"]},{\"name\":\"CRSArrTime\",\"type\":[\"int\"]},{\"name\":\"ArrTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"ArrDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"ArrDelayMinutes\",\"type\":[\"int\",\"null\"]},{\"name\":\"ArrDel15\",\"type\":[\"int\",\"null\"]},{\"name\":\"ArrivalDelayGroups\",\"type\":[\"int\",\"null\"]},{\"name\":\"ArrTimeBlk\",\"type\":[\"string\"]},{\"name\":\"Cancelled\",\"type\":[\"int\"]},{\"name\":\"CancellationCode\",\"type\":[\"string\",\"null\"]},{\"name\":\"Diverted\",\"type\":[\"int\"]},{\"name\":\"CRSElapsedTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"ActualElapsedTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"AirTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"Flights\",\"type\":[\"int\"]},{\"name\":\"Distance\",\"type\":[\"int\"]},{\"name\":\"DistanceGroup\",\"type\":[\"int\"]},{\"name\":\"CarrierDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"WeatherDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"NASDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"SecurityDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"LateAircraftDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"FirstDepTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"TotalAddGTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"LongestAddGTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"DivAirportLandings\",\"type\":[\"int\"]},{\"name\":\"DivReachedDest\",\"type\":[\"int\",\"null\"]},{\"name\":\"DivActualElapsedTime\",\"type\":[\"int\",\"null\"]},{\"name\":\"DivArrDelay\",\"type\":[\"int\",\"null\"]},{\"name\":\"DivDistance\",\"type\":[\"int\",\"null\"]},{\"name\":\"DivAirports\",\"type\":{\"type\":\"array\",\"items\":\"string\"}},{\"name\":\"DivAirportIDs\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivAirportSeqIDs\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivWheelsOns\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivTotalGTimes\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivLongestGTimes\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivWheelsOffs\",\"type\":{\"type\":\"array\",\"items\":\"int\"}},{\"name\":\"DivTailNums\",\"type\":{\"type\":\"array\",\"items\":\"string\"}},{\"name\":\"RandomAirports\",\"type\":{\"type\":\"array\",\"items\":\"string\"}}]}",

-Remove this wording "Then add this key: "stream.kafka.decoder.prop.schema"followed by a value that denotes the location of your schema." as the value of this config is expected to be actual schema not some location.

Can we please make these changes? cc @snleee