implement a VERY simple way to create table definitions

camelia-c commented 10 years ago

A JSON-based table definition, only for the CSVInputFormat. The user should be able to define a SQL queryable table in a JSON file by defining the file-path, record and field delimiter and the field types.

rmetzger commented 10 years ago

Sounds great. Would be cool if you could share the JSON-file structure once you have it .. I just want to have a look

camelia-c commented 10 years ago

{ "fields": [ { "name": "DEPTNO", "type": "int" }, { "name": "NAME", "type": "string" } ], "primaryKey": "DEPTNO", "columnDelimiter": ",", "filePath": "/home/camelia2/stratosphere_sql/stratosphere-sql-1/sales" }

inspired from http://dataprotocols.org/json-table-schema/ This would be the corresponding description for /stratosphere-sql-1/sales/tbl.csv

fhueske commented 10 years ago

Hi Camelia, looks like a good start. This gives the schema definition of a table as in a CREATE TABLE statement. In addition, information to read the data is required, i.e., file path, line and field delimiters.

rmetzger commented 10 years ago

Hey, if you use three ticks ``` GitHub will render the Json nicer ;)

Do you think its better to have one JSON file per Table or all Tables in one file? I don't think we need a primaryKey as we do not index the data.

fhueske commented 10 years ago

I agree that PK might not be the most important thing for a first draft, but in general PKs are a very valuable information even if you do not have indexes. They make cardinality estimation esp. for joins much more accurate.

So I would keep it there as it does not hurt.

camelia-c commented 10 years ago

This is more like a question for clarification and I hope it makes sense. I have the following:

/home/camelia2/stratosphere_sql/stratosphere-sql-1 the repository synchronized to rmetzger/stratosphere-sql
/home/camelia2/stratosphere_main/stratosphere the repository synchronized to stratosphere/stratosphere

I successfully tested the word count example.

Now, how can I test the source code in stratosphere-sql?

Please give me an example as you gave in : ./bin/stratosphere run --jarfile ./examples/java-record-api-examples-0.4-SNAPSHOT-WordCount.jar --arguments 1 file://pwd/hamlet.txt file://pwd/wordcount-result.txt

Robert also mentioned https://github.com/rmetzger/stratosphere/tree/sql_mainline_changes. Can you please help me by indicating me how to make this setup?

For example, now when I do [camelia2@localhost stratosphere-sql-1]$ mvn -DskipTests -e clean package

I get "cannot find symbol" compilation error because it doesn't know how to find package eu.stratosphere.types.

Thank you very much!

rmetzger commented 10 years ago

Hey, sorry for the inconveniences. Obviously, the stratosphere-sql project is not setup for usability (currently ;)

# checkout sql_mainline_changes
cd /home/camelia2/stratosphere_main/stratosphere
git remote add robert  https://github.com/rmetzger/stratosphere
git fetch robert
git checkout sql_mainline_changes
mvn clean install -DskipTests

Now you have the correct Stratosphere version to use the SQL interface. Next, I would highly recommend to import the contents of /home/camelia2/stratosphere_sql/stratosphere-sql-1 as a MAVEN Eclipse project. (I also recommend to import /home/camelia2/stratosphere_main/stratosphere as a maven eclipse project). Then, you can launch the SQL on Stratosphere interface by running the main in the Launcher class. Should work out of the box.

camelia-c commented 10 years ago

I finished the implementation of this task and I committed the code as:

https://github.com/camelia-c/stratosphere-sql-1/commit/e23df24303c12ba2616a8c77e4bd06f530fe87a3

For tomorrow I plan to check some special cases that might occur, such as empty fields array in JSON file. Also, tomorrow I'll make the changes to use the column delimiter and file path specified in the JSON file, not the constant ones. I look forward for your feedback. Thank You!

If you consider it ok, can you please indicate me how to make a pull request to merge the changes into this branch?

rmetzger commented 10 years ago

Thank you. I'll have a look this evening.

camelia-c commented 10 years ago

Can you please check and confirm if this is an acceptable license: http://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl/1.9.13

It says The Apache Software License, Version 2.0, but I want to make sure. Thank you!

rmetzger commented 10 years ago

Yes, Jackson is possible. Stratosphere has already (transitive) dependencies to Jackson. The -asl suffix is for "apache software license" so we are fine ;)

camelia-c commented 10 years ago

I committed the new version, Jackson parser based, at: https://github.com/camelia-c/stratosphere-sql-1/commit/eafbe72268f2ff0505c75b5f2ad5ff04066c5c87

TODOs for me for tomorrow: It remains to use the delimiters at the right place as well as the file path. Furthermore, the folder containing the json files should be specified in a way and I'll look into this as well.

camelia-c commented 10 years ago

I suggest the following approach and I would like to receive your opinion prior to implementing it:

to store all json files in a folder
to pass this folder's name to CsvSchema constructor (now we pass the folder with data files)
in getTableMap to retrieve all files with .json extension and let their file names (without the .json extension) be the tables' names
for each json file retrieved to call the StratosphereTable constructor, passing it the current json file to be parsed.

This way we can link everything What do you think?

I proposed the json filenames as table names because the csv data files might have the same name and differ in filePath instead. E.g.

/home/camelia2/stratosphere_sql/stratosphere-sql-1/CUSTOMER/data.csv /home/camelia2/stratosphere_sql/stratosphere-sql-1/PRODUCT/data.csv

but we have :

/home/camelia2/stratosphere_sql/stratosphere-sql-1/jsonSchemas/customer.json /home/camelia2/stratosphere_sql/stratosphere-sql-1/jsonSchemas/product.json

fhueske commented 10 years ago

Sounds very good to me :smile:

rmetzger commented 10 years ago

Hi,

to store all json files in a folder: agree! ... give me some more time to have a look into the code.

rmetzger commented 10 years ago

Can you pass the directory to the JSON files to the StratosphereSchemaFactory (and then pass it further to the CSVSchema thing) ? Then, I would create a StratosphereTable for each file, containing the details. Currently, your code is basically in the getter (getRowType()) which is not a nice architecture (I know, I did it this way ;) ) because the method might get called quite often.

getTableMap() seems to be the right position to do this. I'm a bit confused by this code (I wrote)

Table tbl = new StratosphereTable();
incoming.add("tbl", tbl);
return incoming.add(schema);

We need to understand that and find a nicer solution.

I think we can rename the CSVSchema class to something more meaningful, such as "JSONSchemaParser" or so.

camelia-c commented 10 years ago

Hello,

I made the refactoring which is represented in my commit at: https://github.com/camelia-c/stratosphere-sql-1/commit/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d I admit that by mistake I deleted the CsvTable.java in a commit and then I had to bring it back to the working branch.

Here are some details: First, in SchemaFactory I made some cleaning at: https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/StratosphereSchemaFactory.java

Secondly, in CsvSchema I added support for parsing all json files in a specified directory and build a schema based on them, by creating a new StratosphereTable() for each such file: https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/CsvSchema.java

Thirdly, in StratosphereTable I made some cleaning after refactoring the code : https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/StratosphereTable.java

The query on the customers table is successfully executed in this new setting.

I look forward for your feedback so far.

Next, it remains to search where do the delimiters come into action. Thank you!

rmetzger commented 10 years ago

Wow. Very cool. The delimiters come into action in the StratosphereDataSource class. You can pass them to CsvInputFormat constructor.

But: In our case, I would recommend to set the configuration of the CsvInputFormat. You can use it this way

FileDataSource lineitems = new FileDataSource(new CsvInputFormat(), lineitemsPath, "LineItems");
        CsvInputFormat.configureRecordFormat(lineitems)
            .recordDelimiter('\n')
            .fieldDelimiter('|')
            .field(LongValue.class, 0)      // order id
            .field(DoubleValue.class, 5);   // extended price

rmetzger commented 10 years ago

The only thing that I see is, that you are probably making your life harder than necessary by using the "streaming" JSON parser. You are probably better of by using the Jackson-Mapper (http://wiki.fasterxml.com/JacksonSampleSimplePojoMapper). It will convert the JSON-structure into a Java-structure of lists, sets and pairs. But you don't have to chance it now (only if you agree with me)

rmetzger commented 10 years ago

Oh wait.. There is probably a easier solution ..

rmetzger commented 10 years ago

The website of Jackson is not very well maintained :(

rmetzger commented 10 years ago

I think this is better http://wiki.fasterxml.com/JacksonTreeModel

camelia-c commented 10 years ago

Thanks for the feedback! I'll look into the delimiters code tomorrow and also start writing the project proposal (that we need to discuss in the following days).

Afterwards, when I'm finished with the proposal I'll come back to modify to Jackson-Mapper if you consider this one better or easier. Personally I enjoyed the streaming parser :) , but I'm ready to change it when necessary.

rmetzger commented 10 years ago

No, lets keep the parser the way it is right now.

Yeah, figuring out a good proposal is difficult since the project is a) at a early stage and b) moving fast. But honestly, your cooperation over the last few days is probably more important for our decision than the proposal ;)

camelia-c commented 10 years ago

Hello,

I'm quite close to finishing the task but I need a clarification about adding fields in CsvInputFormat.

If I omit the fields from CsvInputFormat.configureRecordFormat(src) then I'll get a runtime error :) eu.stratosphere.configuration.IllegalConfigurationException: No fields configured in the CsvInputFormat

So they're mandatory. Now, if I put
FileDataSource src = new FileDataSource(new CsvInputFormat(), filePath, tableName); CsvInputFormat.configureRecordFormat(src) .recordDelimiter(rowDelimiter) .fieldDelimiter(columnDelimiter.charAt(0)) .field(LongValue.class, 0) .field(DoubleValue.class, 5); with example fields just to play, I receive the compilation error:

The method field(Class<? extends Value>, int) in the type CsvInputFormat.AbstractConfigBuilder is not applicable for the arguments (Class, int) StratosphereDataSource.java /stratosphere-sql/src/main/java/eu/stratosphere/sql/relOpt line 63 Java Problem

Please tell me which classes of values are accepted for fields in the StratosphereDataSource.getStratosphereOperator() method and what is the meaning of 0 and 5 in your example. Thank you!

rmetzger commented 10 years ago

The field definitions tell the system how to convert a csv file (or any other delimited input) into a Tuple (called Record here). So

.field(LongValue.class, 0)
.field(DoubleValue.class, 5);

Means that the first element (in a line) is of type long and the sixth element is of type double. The elements are placed after each other in the record. In your case, just assign increasing ids (starting from 0). I think the classes you specified are correct.

camelia-c commented 10 years ago

Ready. My newest commit is at: https://github.com/camelia-c/stratosphere-sql-1/commit/a17bd6d51a62e4c90c7e48a08aed7acab9f33d99

How can I submit changes as a small patch so that you can use them as well? You said these changes are also of help in implementing the join operator and I saw that you're currently working on that.

Thank you very much and have a great week-end! Talk to you soon.

rmetzger commented 10 years ago

You can open a pull request: https://help.github.com/articles/creating-a-pull-request That means you're offering me to merge your code into the main repository.

rmetzger / stratosphere-sql

implement a VERY simple way to create table definitions #1