omkardash / jaql

Automatically exported from code.google.com/p/jaql
0 stars 0 forks source link

Code review request #58

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Could you review my implementation of NON-JSON streaming out? If 
it is OK, I will merge it into trunk.

For changes in this branch, refer to Revision 357.

Original issue reported on code.google.com by yaojingguo@gmail.com on 28 Sep 2009 at 1:33

GoogleCodeExporter commented 9 years ago
Sorry, Revision 358.

1 Add jsonToCsv and jsonToXml functions
2 Add UnquotedJsonString for unquoted printing of XML and CSV in Jaql Shell
3 Totally disable console print in Jaql Shell batch mode

Original comment by yaojingguo@gmail.com on 28 Sep 2009 at 1:35

GoogleCodeExporter commented 9 years ago
The attachment is a zip file which contains several jaql query scripts. You can 
try
them by issuing "jaqlshell -b <file-name>".

Original comment by yaojingguo@gmail.com on 28 Sep 2009 at 1:59

Attachments:

GoogleCodeExporter commented 9 years ago
You need not to review the following files. They are not changed.

/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParser.java
/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParserConstants.java
/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParserTokenManager.java

Original comment by yaojingguo@gmail.com on 28 Sep 2009 at 2:09

GoogleCodeExporter commented 9 years ago
Initial comments:

1. How is com.ibm.jaql.io.hadoop.ToDelConverter different from the JsonToCsvFn? 
The "guts" of both of these (e.g., to convert JSON to CSV) should be the same 
so I'd 
prefer that we converge here instead of having two pieces of near duplicate 
code. 

2. The requirement for an unquoted string to be a new type is unneccesary. By 
using 
streams and formats/converters/serializers, there is no need for such format 
specific issues bubbling up into the type system.

3. Why is JsonParser different if the jj file has not changed?

4. We should consolidate the various JaqlUtil classes that are currently 
present 
instead of adding a new one.

5. Lets remove the Assert class until we agree on using it. Needs more 
discussion.

6. I need a clarification on the "batch" flag and non-json output with regards 
to 
the shell. If the user requests interactive mode, i.e., "non-batch" mode, the 
format 
for stdout will be JSON? Otherwise, in batch mode, the user can choose from one 
of 
several formats, including JSON. However, the user cannot print some other 
format, 
e.g., CSV when in interactive mode, correct?

7. The new class to disable output of the shell seems to do so globally at 
java's 
System level. Is this related to the issue that you had where "write" outputs 
its 
return value (a descriptor) yet for the case of non-JSON output, you also want 
to 
print to stdout? If this is the case, why not simply suppress output in the 
jaql 
shell for this case?

Original comment by vuk.erce...@gmail.com on 28 Sep 2009 at 11:01

GoogleCodeExporter commented 9 years ago
Hi, Vuk

Here are my responses to your comments except the 2nd comment. I am still 
thinking
about the 2nd comment.

1 There are 2 differences. ToDelConverter only handles the table-like JSON 
array (A
JSON array is table-like if all of its elements are JSON array or JSON record. 
Refer
to Javadoc for JsonToCsvFn.table). For example, del function can't convert a 
single
JSON string to CSV. ToDelConverter converts JOSN values to Hadoop text. 
JsonToCsvFn
converts JSON values to JSON string. I will do the converge.

3 In Eclipse, select "Project->Clean" and select jaql project to do a clean. 
After
the clean is finished, JsonParserTokenManager.java, JsonParserConstants.java and
JsonParser.java will appear modified. After checking .project file, I find that 
the
following text.

    <buildCommand>
            <name>sf.eclipse.javacc.javaccbuilder</name>
            <arguments>
            </arguments>
    </buildCommand>

Since I have installed JavaCC Eclipse plugin, the above 3 Java files has been
automatically re-created. I use 3.4 version of Eclipse IDE for Java Developers.
"Build Automatically" is selected.

4 This new JaqlUtil is unreferenced. I have removed it. If you think that this 
method
is of value, I can add the following method to Jaql.
  /**
   * Evaluates queries in the script until a JSON value is evaluated.
   * 
   * @return The evaluated JSON value
   * @throws Exception
   */
  public JsonValue evalValue(String jaqlStr) throws Exception {
    Expr e;
    do {
      e = prepareNext();
    } while (e == null);
    JsonValue jv = e.eval(context);
    return jv;
  }

5 Assert class is removed.

6 For the current implementation, jsonToCsv and jsonToXml work for both batach 
mode
and interactive mode.
6.1 If the user requests interactive mode, i.e., "non-batch" mode, the format 
for
stdout will be JSON. The JSON format is default. But the user can still use 
jsonToCsv
or jsonToXml to get the output in CSV format or XML format.
6.2 In batch mode, the user can use any format for output. They can make the 
mixed
use of output formats. For example, they can first use JSON format to print 
some JSON
values. Then they can use CSV format to print some other JSON values.
6.3 In interactive mode, the user can get print in any available output format. 
For
now, they are JSON, CSV and XML.

7 Even in batch mode, the following messages are printed to console when 
starting up
Jaql Shell.

Starting DataNode 0 with dfs.data.dir:
/tmp/jaql/dfs/dfs/data/data1,/tmp/jaql/dfs/dfs/data/data2
Waiting for the Mini HDFS Cluster to start...
Generating rack names for tasktrackers
Generating host names for tasktrackers

And the following messages are printed to console when shutting down Jaql Shell.

Shutting down the Mini HDFS Cluster
Shutting down DataNode 0

All these information will mess up with the printed JSON vale in batch mode. 
MiniHBaseCluster, MiniHBaseCluster and MiniDFSCluster produce these messages. 
For
details, refer to Java for ConsolePrintEnabler and
JaqlShellArguments.enableConsolePrint. Since they are out of our control, I 
need to
disable console print at a global level. Maybe there is a better way to do it.

Original comment by yaojingguo@gmail.com on 29 Sep 2009 at 5:22

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi, Vuk

I have finished the implementation based on our discussion. Please give your
comments. I will merge my changes to svn trunk if you think that my 
implementation is
OK. At the same time, I will continue to do some refactoring and test during my 
vacation.

1 I don't add an option for "no echo". In batch mode, "no echo" mode is used 
for Jaql
Shell. I have made some investigation on Ruby and Python. When fed with a 
script,
they don't echo.

2 AbstractWriteExpr is changed to allow NON-array JSON values. XML's tree 
structure
does not fit with array mode.

3 Add the following functions: stdout, csvFormat and xmlFormat. Try the 
following
queries to see how they works. 
- [1,2,3]->write(stdout());
- [[100,200], [300,400]]->write(stdout(csvFormat()));
- {root: {content: [1, 2, 3]}}->write(stdout(xmlFormat()));
- [[1,2],[3,4]]->write(stdout({format:
'com.ibm.jaql.io.stream.converter.LinesJsonTextOutputStream', converter:
'com.ibm.jaql.io.stream.converter.ToDelConverter'})); 

Original comment by yaojingguo@gmail.com on 30 Sep 2009 at 4:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I have committed some updates to this branch. Some new comments:
1. If no location is provided, STDOUT is used in StreamOutputAdapter's 
openStream method.
2. JsonToStream's close method is renamed to cleanUp. If the output stream is 
STDOUT,
it will not be closed in cleanUp method.
3. My chagnes to AbstractWriteExpr's eval method break some test cases. I have
removed some of the test case failures. But TestModule still fails. I am
investigating on it.

Original comment by yaojingguo@gmail.com on 1 Oct 2009 at 2:19

GoogleCodeExporter commented 9 years ago
TestModule only fails on Cygwin. It succeeds on Linux. I have just run "ant 
test" on
Ubuntu Linux. It passed.

Original comment by yaojingguo@gmail.com on 1 Oct 2009 at 3:37

GoogleCodeExporter commented 9 years ago
Comments (not in any particular order) ...

1. JsonToStream: why did you change the method name from close to cleanUp?

2. StreamOutputAdapter: why is there a need for a "converter"? For the stream 
stuff, 
we have JsonValue -> stream. With the proposed change, we have JsonValue -> 
JsonValue -> stream. Why the extra step in the I/O package? The language should 
already handle all JsonValue -> JsonValue conversions. Can this conversion you 
propose be handled via a function?

3. StreamOutputAdapter: when a "location" is not given, the default is assumed 
to be 
stdout. Lets make the default be configurable-- we'll need methods to set/get 
the 
default and the method that creates a stream should use this default.

4. Why do I see a JsonEncoding.java.orig? Similarly, SystemNamespace.java.orig?

5. What is the "om" directory under src/test?

6. What is the intention behind the various "OutputFormatFn" classes? It seems 
that 
there is an assumption regarding "streams" there. Why is "format" necessarily 
related to "stream". We also support csv or xml with the record-oriented output 
adapters (e.g., hadoop files). 

7. Lets not change "write" to accept non-arrays. If you want to use write, you 
wrap 
its input in an array for now. What advantages do you see for changing write's 
functionality to accept non-arrays?

8. I see that JsonTextOutputStream has been removed. Did you rename it or 
remove its 
functionality. I think we want to retain its functionality and if one wants to 
just 
print one line per json value, then they can pass in a parameter to this class 
(is 
this the purpose of LinesJsonTextOutputStream?)

9. Let use consistent naming-- what is it, CSV or Del? It should be Del.

10. It would be good to re-use ToDelConverter as much as possible-- I see some 
redundancy in code. The level of re-use in the current design forces an 
extra "converter" to streams. It would be a better idea to see how we can 
re-use the 
code without requiring this extra parameter. Consider making an abstract for 
ToDelConverter and adding the necessary support (should be minimal change) for 
streams. Also, while the streaming support is important for flexibility, it is 
meant 
more for "small" data that is used for sequential processing instead of large 
data 
that know how to be partitioned (e.g., InputFormat).

Original comment by vuk.erce...@gmail.com on 1 Oct 2009 at 6:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi, Vuk

I have committed the new implementation based on your comments. Please review.

1 The reason is that the output stream will not be closed if it is STDOUT. If 
you
think that close is better, I will change it back.
3 Done. Refer to StreamOutputAdapter.
4 Removed.
5 Removed.
8 It is renamed to ArrayJsonTextOutputStream. LinesJsonTextOutputStream and
ArrayJsonTextOutputStream extends AbstractJsonTextOutputStream.
ArrayJsonTextOutputStream is the old JsonTextOutputStream. Yes,
LinesJsonTextOutputStream is for printing one line per JSON value.
9 Renamed to Del.

2, 6, 7, 10
Comment 2, 6, 7 and 10 are related to each other. It seems that I have 
misunderstood
your meaning. I have rewritten the implementation based on your new comments.
ToDelConverter and JsonToDelFn have wrapped JsonToDel since JsonToDelFn must 
extend
Expr. The following text shows the new functions with some examples: 

stdout(): By default, it uses LinesJsonTextOutputStream.
- stdout();
- [1,2,3]->write(stdout());
- [1,2,3]->write(stdout({format:
'com.ibm.jaql.io.stream.converter.ArrayJsonTextOutputStream'}));

jsonToDel(): convert a JSON value array to an array of JSON string.
- [[1,2],[3,4]]->jsonToDel();
- [[1,2],[3,4]]->jsonToDel()->write(stdout());
- [['one','two'],['three','four']]->jsonToDel();
- [['one','two'],['three','four']]->jsonToDel()->write(stdout());

jsonToXml(): convert a JSON value array to an array of JSON string.
- {root: {content: [1,2,3]}}->jsonToXml();
- {root: {content: [1, 2, 3]}}->jsonToXml()->write(stdout());

Original comment by yaojingguo@gmail.com on 3 Oct 2009 at 5:57

GoogleCodeExporter commented 9 years ago
1. change "cleanUp" back to "close"

2. there are several inconsistencies when it comes to "csv" vs. "del"

   - package name is "csv" (com.ibm.jaql.expr.csv) but classes refer to "del"
   - comments refer to "csv" in "del" files (e.g., JsonToDel)
   - similar issues down test

3. I'm unclear on how JsonToDel is being tested. I see that for certain 
testcases 
that are expected to fail, a fail() is used. For those that are expected to 
pass, I 
expect to see some sort of validation against expected output. Perhaps I'm 
missing 
something... In either case, I think its a good idea to show tests via java but 
I'd 
also like to see several tests that use the current scheme (query + gold file).

4. There was a modification down hadoop specific changes for version 0.20.1. 
Please 
remember to make these changes to versions of this file that pertain to all 
supported hadoop versions.

5. Where are we when it comes to the shell? What are the options for echoing? 
Are 
stdout/stderr still being redirected in certain cases? Going back to the 
original 
feature description, we want to make it easy for a user to output del so that 
its 
easy to combine w/ other programs that are hooked together through a script. So 
far, 
I see that the user will need to add a "write(stdout())" call. So if I just 
want to 
read some json, I'd do 'read(hdfs("foo")) -> write(stdout())'. What about 
simply 'read(hdfs("foo"))' but on the shell speciy that the output should be 
del, 
not json via a flag such as "--format del"?

Original comment by vuk.erce...@gmail.com on 5 Oct 2009 at 6:33

GoogleCodeExporter commented 9 years ago
1 Done
2 Done
3 Fail is used to verify that a specific exception be thrown. If 
  the wanted exception is not thrown, the test case will fail.  
  For test cases expected to pass, only the result is logged.  
  Yes, it is better to do some verification. I will rewrite test 
  cases to use the existing query+gold scheme.
4 Done
5.1 For interactive mode, Jaql Shell is in echoing mode. For 
  batch mode, Jaql Shell is in non-echoing mode. In batch mode, 
  some outputs to stdout/stderror are disabled. I don't provide 
  an option for echoing. This decision is based on the 
  investigation on Python shell and Ruby shell. In interactive 
  mode, Python and Ruby will echo expressions. But when fed with 
  a script, they will print nothing to stdout unless puts/print 
  statements are used.
5.2 If we want to use 'read(hdfs("foo"))' to print JSON, we need  
  to use echoing mode. In echoing mode, a lot of jaql queries 
  will print text to stdout. This text will mess up with the 
  output of the desired JSON values. And if 'del' option is used, 
  jaql queries which don't produce an JSON array will cause error 
  since only JSON arrays are valid for del format.  The same is 
  for xml format. So I think it is better to let the user 
  explicitly use 'write(stdout())' whenever he want to print 
  something in a Jaql script. This is also similar to the 
  approach taken by Ruby and Python.

Original comment by yaojingguo@gmail.com on 9 Oct 2009 at 3:32

GoogleCodeExporter commented 9 years ago
3. Finish the rewriting of JsonToDelTest.java and StdoutFnTest.java with the 
use of
query+gold scheme.

Original comment by yaojingguo@gmail.com on 10 Oct 2009 at 11:11

GoogleCodeExporter commented 9 years ago
1. JsonToDelFn: if an array is expected, then it is preferable to access it via 
an 
iterator instead of using "eval", which materializes the input expr's value. 
Using 
eval allocates memory up-front... imagine if the input expression produces 20 
GB of 
data, it would be preferable to stream through it via iteration.

2. What does the JsonToDel tests provide in addition to the gold/query test 
cases? 
In addition, "eval" is not validated. I propose that we go with the gold/query 
test 
cases.

3. In general, lets try to group related tests. The file and test cases for 
stdout 
is too trivial to warrant its own test files. Furthermore, the del tests should 
be 
moved to the storage tests-- there you will already see handling for multiple 
formats. stdout should go there as well since all other stream tests are there.

4. A common use case is a single, simple pipeline. For this case, lets make it 
easy 
on the user to output del if they want to. It is understood that this will not 
always be meaningful (e.g., multiple statements and arbitrarily nested data). 
This 
does not seem to be a common operation in Python/Ruby so whether they support 
this 
feature or not is not an argument for our case. The changes that you made 
should 
remain and in addition, given a cmd-line switch, change the output format to be 
del.

Original comment by vuk.erce...@gmail.com on 12 Oct 2009 at 6:37

GoogleCodeExporter commented 9 years ago
This code review is finished.

Original comment by yaojingguo@gmail.com on 14 Oct 2009 at 2:39

GoogleCodeExporter commented 9 years ago
This code review is finished.

Original comment by yaojingguo@gmail.com on 14 Oct 2009 at 2:39