Add Schema for AVRO data sources.

zerolevel commented 10 years ago

Hi folks,

I have added the pull request for AVRO data sources.

Please review and comment.

Best, Mohit

rmetzger commented 10 years ago

Cool! Would you mind adding a little Unit-test for that? Add a very simple avro file into the resources directory and access the Table in a SQL query.

rmetzger commented 10 years ago

For the whole integration with the rest of the framework: Ideally, the user should be able to configure the stratosphere SQL interface using a directory with JSON files. Camelia added support for JSON parsing for CSV tables. It would be cool if users could define the avro parser in a Json file.

{
  "type": "avro",
  "name": "customers_simple",
  "filePath": "/home/robert/Projekte/ozone/stratosphere-sql/src/main/resources/sampleTables/simple.avro"
}

I think we also should add a "type": "csv" for the CSV reader.

zerolevel commented 10 years ago

hi robert. I will add the unit test. It might take a while because I will be commuting to home in the gap. I think in line to your above suggestion, would you mind looking at http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/

So using the above I have created few AVRO files. Also since the avro file contain the schema and data in the same. (A serialization used for some big data tools)

So I am still to understand your suggestion about creating a json parsing as mentioned.

fhueske commented 10 years ago

Hi,

the idea is to have a single directory where the system can find all its table metadata, i.e., the information about all registered tables. Camelia added support to define a table in json. A json file contains the schema of the table (attributes (columns) their names and data types) and the location of the data. At start-up the system goes to the metadata directory and parses all json files in there. The table meta data contained in the json files is then registered in the system such that queries can be ran against the registered tables.

To make your work fit in the system, we need a json file (maybe with the schema that Robert suggested), that tells the system where to look for the data, i.e., the path to the AVRO file. At start-up, the system will read the json file, find out that is refers to a AVRO file, and your schema parser will than be able to access the file, read the schema and register it in the system.

zerolevel commented 10 years ago

hi, Thanks for your comment. The pull request I have added gets the schema from the AVRO file.

Whereas CsvSchema class reads the json file in the resources/jsonSchemas folder. It then associate a CSV file from resources/sampleTabeles

Now since in AVRO the data and the schema are present in the same file. (.avro file) And the current code AvroSchema only reads the schema. (That is what I intended to write.)

Now I could also support it to read the data from the Avro Source. But that might require some time. And seeing the deadline (GSOC'14), what do you suggest me to do ?

zerolevel commented 10 years ago

hi.. sorry! I closed the issue instead of pressing "Comment" Thus, Reopened. :)

rmetzger commented 10 years ago

No probs ;)

You are right, the Avro files already contain a schema. The reason why we want a dedicated JSON file describing the location of the Avro file is the following: Usually, in "big data" processing systems, the Avro files (or other files) are not located on the machine where you are actually submitting the SQL query. The files are usually in HDFS. Therefore, we need a JSON file that says, a) its of type "avro" and b) its located in "hdfs://namenode-host:8031/data-warehouse/data/customers.avro".

But that might require some time. And seeing the deadline (GSOC'14), what do you suggest me to do ?

Try to do as much as possible. There is/was time from 24 February until 21 March to convince us that you are the right candidate. On 21. March, we'll look what the students did and evaluate that. It is totally fine to submit work-in-progress. Its all about seeing your coding skills, your will to contribute etc. (I know all this is a bit hard, and believe me, the choice is not easy for us. But we need some kind of measurement to rank and evaluate the students. And I think our rules are very clear and equal for all applying students.)

zerolevel commented 10 years ago

Hi robert,

Thanks for the reply.

I have also created the first draft of my project proposal and can be found @ https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Mohit-Daga

Please suggest changes. This in line to the discussion we had on the mailing list.

rmetzger commented 10 years ago

Hi,

some comments:

[Required] Write adapter optiq adapter for HCatalog. [Required] Follow the steps given in the Optiq Tutorial at [5]

You don't really write an adapter against optiq in the way it is described in the Optiq tutorial. The adapter-tutorials from Optiq assume a different execution engine. Therefore, many of the things there do not apply to our project, since we have our own exec. engine.

The information you've extracted from the Avro Schema is the same information you need to provide to our framework, just from HCatalog. It is probably pretty straight-forward to add HCatalog support to Stratosphere SQL.

Writing good ORC, Parquet support is probably more demanding since you have to go into the depths of data serialization etc.

[Required] Support RC File, Integrate them with the Stratosphere System.

Replace RC by ORC.

Can we do the goals as follows:

Required for mid-terms:

Add "simple" support for ORC and Parquet (using the Hadoop Wrappers)
Prototype for HCatalog implementation

Final requirements: Required:

Finished and working HCatalog support
ORC/Parquet support for answering queries from headers Optional:
ORC/Parquet "performance" support.
Help on minor tasks on the SQL interface.

@fhueske: Can you also comment on this?

zerolevel commented 10 years ago

Hi robert. Thanks for quick comment.

I got your point. Infact we are doing things in the same line as optiq with stratosphere but the execution engine and Stratosphere classes have changed. I believe your and other developers' directions will be required all the way in the coding period and even till the end. (if I get selected ;) )

rmetzger commented 10 years ago

I believe your and other developers' directions will be required all the way in the coding period and even till the end. (if I get selected ;) )

That's true. As mentioned on the idea page, we try to move fast with the SQL interface. Also, we can not really plan how things are going in the future since we are in a very early stage. So good communication is certainly required for this GSoC task.

Infact we are doing things in the same line as optiq with stratosphere but the execution engine and Stratosphere classes have changed.

Optiq ships with an execution engine called linq4j. We are using our own exec engine. We use Optiq for query parsing, validation, rewriting and optimization. Optiq gives us a tree of physical operators that Stratosphere can execute. We are using custom rules to translate Optiq operators into Stratosphere operators.

fhueske commented 10 years ago

Hi,

I would draft the schedule as follows:

Integrate HCatalog with Optiq. The goal is that Optiq gets the metadata in HCatalog. @rmetzger correct me if I am wrong, but this would not touch Stratosphere at all. It might even be, that this code is already somewhere around (the new Hive(Stinger) optimizer uses Optiq, so their might be code for HCatalog integration). If this code is not available, we might even consider to contribute it to Optiq. This whole task should not be too hard as Robert said.
Integration of file formats (ORC, Parquet, RC, ...) with Stratosphere-SQL (and Stratosphere). This can be done one format at a time. So you do not need to understand all these formats before you can start working on HCatalog.

I did not really get what you meant with "(Optional) Support Stratosphere-SQL". I thought your project is about HCatalog and InputFormat support for Stratosphere-SQL.

Regarding the different Optiq adapters, they are on very different levels. CSV is an input adapter to get metadata and injest data, Cascading uses Optiq the way we plan to use it as an optimizer to generate execution plans that are executed on Cascading. MongoDB is another storage adapter like CSV.

On a side note, it is always good to run a spell checker for application documents. Don't worry, I am not pedantic about that, but typos are just unnecessary and distracting. :smirk:

zerolevel commented 10 years ago

Hii,

I did not really get what you meant with "(Optional) Support Stratosphere-SQL". I thought your project is about HCatalog and InputFormat support for Stratosphere-SQL.

I meant supporting the issues of the stratosphere - sql repository. I have modified that and add this way "Help on minor tasks on the SQL interface."

I believe this clarifies it

fhueske commented 10 years ago

Yes it does. Thank you! :smile:

rmetzger commented 10 years ago

Exactly. As far as I understood the idea of GSoC, students have their own project they are working on. But we should also try to engage students to participate in the development community. And I think fixing minor issues, helping with documentation, user support etc. is a great way to do that.

rmetzger commented 10 years ago

@fhueske: thanks for reviewing. Exactly, you don't have to touch Stratosphere at all for the metadata.

zerolevel commented 10 years ago

see #22

rmetzger / stratosphere-sql

Add Schema for AVRO data sources. #11