Open camelia-c opened 10 years ago
Sounds great. Would be cool if you could share the JSON-file structure once you have it .. I just want to have a look
{ "fields": [ { "name": "DEPTNO", "type": "int" }, { "name": "NAME", "type": "string" } ], "primaryKey": "DEPTNO", "columnDelimiter": ",", "filePath": "/home/camelia2/stratosphere_sql/stratosphere-sql-1/sales" }
inspired from http://dataprotocols.org/json-table-schema/ This would be the corresponding description for /stratosphere-sql-1/sales/tbl.csv
Hi Camelia, looks like a good start. This gives the schema definition of a table as in a CREATE TABLE statement. In addition, information to read the data is required, i.e., file path, line and field delimiters.
Hey, if you use three ticks ``` GitHub will render the Json nicer ;)
Do you think its better to have one JSON file per Table or all Tables in one file? I don't think we need a primaryKey as we do not index the data.
I agree that PK might not be the most important thing for a first draft, but in general PKs are a very valuable information even if you do not have indexes. They make cardinality estimation esp. for joins much more accurate.
So I would keep it there as it does not hurt.
This is more like a question for clarification and I hope it makes sense. I have the following:
I successfully tested the word count example.
Now, how can I test the source code in stratosphere-sql?
Please give me an example as you gave in :
./bin/stratosphere run --jarfile ./examples/java-record-api-examples-0.4-SNAPSHOT-WordCount.jar --arguments 1 file://pwd
/hamlet.txt file://pwd
/wordcount-result.txt
Robert also mentioned https://github.com/rmetzger/stratosphere/tree/sql_mainline_changes. Can you please help me by indicating me how to make this setup?
For example, now when I do [camelia2@localhost stratosphere-sql-1]$ mvn -DskipTests -e clean package
I get "cannot find symbol" compilation error because it doesn't know how to find package eu.stratosphere.types.
Thank you very much!
Hey, sorry for the inconveniences. Obviously, the stratosphere-sql project is not setup for usability (currently ;)
# checkout sql_mainline_changes
cd /home/camelia2/stratosphere_main/stratosphere
git remote add robert https://github.com/rmetzger/stratosphere
git fetch robert
git checkout sql_mainline_changes
mvn clean install -DskipTests
Now you have the correct Stratosphere version to use the SQL interface.
Next, I would highly recommend to import the contents of /home/camelia2/stratosphere_sql/stratosphere-sql-1 as a MAVEN Eclipse project.
(I also recommend to import /home/camelia2/stratosphere_main/stratosphere as a maven eclipse project).
Then, you can launch the SQL on Stratosphere interface by running the main in the Launcher
class.
Should work out of the box.
I finished the implementation of this task and I committed the code as:
https://github.com/camelia-c/stratosphere-sql-1/commit/e23df24303c12ba2616a8c77e4bd06f530fe87a3
For tomorrow I plan to check some special cases that might occur, such as empty fields array in JSON file. Also, tomorrow I'll make the changes to use the column delimiter and file path specified in the JSON file, not the constant ones. I look forward for your feedback. Thank You!
If you consider it ok, can you please indicate me how to make a pull request to merge the changes into this branch?
Thank you. I'll have a look this evening.
Can you please check and confirm if this is an acceptable license: http://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl/1.9.13
It says The Apache Software License, Version 2.0, but I want to make sure. Thank you!
Yes, Jackson is possible. Stratosphere has already (transitive) dependencies to Jackson.
The -asl
suffix is for "apache software license" so we are fine ;)
I committed the new version, Jackson parser based, at: https://github.com/camelia-c/stratosphere-sql-1/commit/eafbe72268f2ff0505c75b5f2ad5ff04066c5c87
TODOs for me for tomorrow: It remains to use the delimiters at the right place as well as the file path. Furthermore, the folder containing the json files should be specified in a way and I'll look into this as well.
I suggest the following approach and I would like to receive your opinion prior to implementing it:
This way we can link everything What do you think?
I proposed the json filenames as table names because the csv data files might have the same name and differ in filePath instead. E.g.
/home/camelia2/stratosphere_sql/stratosphere-sql-1/CUSTOMER/data.csv /home/camelia2/stratosphere_sql/stratosphere-sql-1/PRODUCT/data.csv
but we have :
/home/camelia2/stratosphere_sql/stratosphere-sql-1/jsonSchemas/customer.json /home/camelia2/stratosphere_sql/stratosphere-sql-1/jsonSchemas/product.json
Sounds very good to me :smile:
Hi,
to store all json files in a folder
: agree!
... give me some more time to have a look into the code.
Can you pass the directory to the JSON files to the StratosphereSchemaFactory
(and then pass it further to the CSVSchema thing) ?
Then, I would create a StratosphereTable
for each file, containing the details.
Currently, your code is basically in the getter (getRowType()
) which is not a nice architecture (I know, I did it this way ;) ) because the method might get called quite often.
getTableMap()
seems to be the right position to do this.
I'm a bit confused by this code (I wrote)
Table tbl = new StratosphereTable();
incoming.add("tbl", tbl);
return incoming.add(schema);
We need to understand that and find a nicer solution.
I think we can rename the CSVSchema
class to something more meaningful, such as "JSONSchemaParser" or so.
Hello,
I made the refactoring which is represented in my commit at: https://github.com/camelia-c/stratosphere-sql-1/commit/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d I admit that by mistake I deleted the CsvTable.java in a commit and then I had to bring it back to the working branch.
Here are some details: First, in SchemaFactory I made some cleaning at: https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/StratosphereSchemaFactory.java
Secondly, in CsvSchema I added support for parsing all json files in a specified directory and build a schema based on them, by creating a new StratosphereTable() for each such file: https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/CsvSchema.java
Thirdly, in StratosphereTable I made some cleaning after refactoring the code : https://github.com/camelia-c/stratosphere-sql-1/blob/13abf4d72730d8ffba03e7bf51e83bbb14b66f4d/src/main/java/eu/stratosphere/sql/StratosphereTable.java
The query on the customers table is successfully executed in this new setting.
I look forward for your feedback so far.
Next, it remains to search where do the delimiters come into action. Thank you!
Wow. Very cool.
The delimiters come into action in the StratosphereDataSource
class.
You can pass them to CsvInputFormat
constructor.
But: In our case, I would recommend to set the configuration of the CsvInputFormat. You can use it this way
FileDataSource lineitems = new FileDataSource(new CsvInputFormat(), lineitemsPath, "LineItems");
CsvInputFormat.configureRecordFormat(lineitems)
.recordDelimiter('\n')
.fieldDelimiter('|')
.field(LongValue.class, 0) // order id
.field(DoubleValue.class, 5); // extended price
The only thing that I see is, that you are probably making your life harder than necessary by using the "streaming" JSON parser. You are probably better of by using the Jackson-Mapper (http://wiki.fasterxml.com/JacksonSampleSimplePojoMapper). It will convert the JSON-structure into a Java-structure of lists, sets and pairs. But you don't have to chance it now (only if you agree with me)
Oh wait.. There is probably a easier solution ..
The website of Jackson is not very well maintained :(
I think this is better http://wiki.fasterxml.com/JacksonTreeModel
Thanks for the feedback! I'll look into the delimiters code tomorrow and also start writing the project proposal (that we need to discuss in the following days).
Afterwards, when I'm finished with the proposal I'll come back to modify to Jackson-Mapper if you consider this one better or easier. Personally I enjoyed the streaming parser :) , but I'm ready to change it when necessary.
No, lets keep the parser the way it is right now.
Yeah, figuring out a good proposal is difficult since the project is a) at a early stage and b) moving fast. But honestly, your cooperation over the last few days is probably more important for our decision than the proposal ;)
Hello,
I'm quite close to finishing the task but I need a clarification about adding fields in CsvInputFormat.
If I omit the fields from CsvInputFormat.configureRecordFormat(src) then I'll get a runtime error :) eu.stratosphere.configuration.IllegalConfigurationException: No fields configured in the CsvInputFormat
So they're mandatory. Now, if I put
FileDataSource src = new FileDataSource(new CsvInputFormat(), filePath, tableName);
CsvInputFormat.configureRecordFormat(src)
.recordDelimiter(rowDelimiter)
.fieldDelimiter(columnDelimiter.charAt(0))
.field(LongValue.class, 0)
.field(DoubleValue.class, 5);
with example fields just to play, I receive the compilation error:
The method field(Class<? extends Value>, int) in the type CsvInputFormat.AbstractConfigBuilder
Please tell me which classes of values are accepted for fields in the StratosphereDataSource.getStratosphereOperator() method and what is the meaning of 0 and 5 in your example. Thank you!
The field definitions tell the system how to convert a csv file (or any other delimited input) into a Tuple (called Record here). So
.field(LongValue.class, 0)
.field(DoubleValue.class, 5);
Means that the first element (in a line) is of type long and the sixth element is of type double. The elements are placed after each other in the record. In your case, just assign increasing ids (starting from 0). I think the classes you specified are correct.
Ready. My newest commit is at: https://github.com/camelia-c/stratosphere-sql-1/commit/a17bd6d51a62e4c90c7e48a08aed7acab9f33d99
How can I submit changes as a small patch so that you can use them as well? You said these changes are also of help in implementing the join operator and I saw that you're currently working on that.
Thank you very much and have a great week-end! Talk to you soon.
You can open a pull request: https://help.github.com/articles/creating-a-pull-request That means you're offering me to merge your code into the main repository.
A JSON-based table definition, only for the CSVInputFormat. The user should be able to define a SQL queryable table in a JSON file by defining the file-path, record and field delimiter and the field types.