Closed GoogleCodeExporter closed 9 years ago
Sorry, Revision 358.
1 Add jsonToCsv and jsonToXml functions
2 Add UnquotedJsonString for unquoted printing of XML and CSV in Jaql Shell
3 Totally disable console print in Jaql Shell batch mode
Original comment by yaojingguo@gmail.com
on 28 Sep 2009 at 1:35
The attachment is a zip file which contains several jaql query scripts. You can
try
them by issuing "jaqlshell -b <file-name>".
Original comment by yaojingguo@gmail.com
on 28 Sep 2009 at 1:59
Attachments:
You need not to review the following files. They are not changed.
/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParser.java
/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParserConstants.java
/branches/non-json/src/java/com/ibm/jaql/json/parser/JsonParserTokenManager.java
Original comment by yaojingguo@gmail.com
on 28 Sep 2009 at 2:09
Initial comments:
1. How is com.ibm.jaql.io.hadoop.ToDelConverter different from the JsonToCsvFn?
The "guts" of both of these (e.g., to convert JSON to CSV) should be the same
so I'd
prefer that we converge here instead of having two pieces of near duplicate
code.
2. The requirement for an unquoted string to be a new type is unneccesary. By
using
streams and formats/converters/serializers, there is no need for such format
specific issues bubbling up into the type system.
3. Why is JsonParser different if the jj file has not changed?
4. We should consolidate the various JaqlUtil classes that are currently
present
instead of adding a new one.
5. Lets remove the Assert class until we agree on using it. Needs more
discussion.
6. I need a clarification on the "batch" flag and non-json output with regards
to
the shell. If the user requests interactive mode, i.e., "non-batch" mode, the
format
for stdout will be JSON? Otherwise, in batch mode, the user can choose from one
of
several formats, including JSON. However, the user cannot print some other
format,
e.g., CSV when in interactive mode, correct?
7. The new class to disable output of the shell seems to do so globally at
java's
System level. Is this related to the issue that you had where "write" outputs
its
return value (a descriptor) yet for the case of non-JSON output, you also want
to
print to stdout? If this is the case, why not simply suppress output in the
jaql
shell for this case?
Original comment by vuk.erce...@gmail.com
on 28 Sep 2009 at 11:01
Hi, Vuk
Here are my responses to your comments except the 2nd comment. I am still
thinking
about the 2nd comment.
1 There are 2 differences. ToDelConverter only handles the table-like JSON
array (A
JSON array is table-like if all of its elements are JSON array or JSON record.
Refer
to Javadoc for JsonToCsvFn.table). For example, del function can't convert a
single
JSON string to CSV. ToDelConverter converts JOSN values to Hadoop text.
JsonToCsvFn
converts JSON values to JSON string. I will do the converge.
3 In Eclipse, select "Project->Clean" and select jaql project to do a clean.
After
the clean is finished, JsonParserTokenManager.java, JsonParserConstants.java and
JsonParser.java will appear modified. After checking .project file, I find that
the
following text.
<buildCommand>
<name>sf.eclipse.javacc.javaccbuilder</name>
<arguments>
</arguments>
</buildCommand>
Since I have installed JavaCC Eclipse plugin, the above 3 Java files has been
automatically re-created. I use 3.4 version of Eclipse IDE for Java Developers.
"Build Automatically" is selected.
4 This new JaqlUtil is unreferenced. I have removed it. If you think that this
method
is of value, I can add the following method to Jaql.
/**
* Evaluates queries in the script until a JSON value is evaluated.
*
* @return The evaluated JSON value
* @throws Exception
*/
public JsonValue evalValue(String jaqlStr) throws Exception {
Expr e;
do {
e = prepareNext();
} while (e == null);
JsonValue jv = e.eval(context);
return jv;
}
5 Assert class is removed.
6 For the current implementation, jsonToCsv and jsonToXml work for both batach
mode
and interactive mode.
6.1 If the user requests interactive mode, i.e., "non-batch" mode, the format
for
stdout will be JSON. The JSON format is default. But the user can still use
jsonToCsv
or jsonToXml to get the output in CSV format or XML format.
6.2 In batch mode, the user can use any format for output. They can make the
mixed
use of output formats. For example, they can first use JSON format to print
some JSON
values. Then they can use CSV format to print some other JSON values.
6.3 In interactive mode, the user can get print in any available output format.
For
now, they are JSON, CSV and XML.
7 Even in batch mode, the following messages are printed to console when
starting up
Jaql Shell.
Starting DataNode 0 with dfs.data.dir:
/tmp/jaql/dfs/dfs/data/data1,/tmp/jaql/dfs/dfs/data/data2
Waiting for the Mini HDFS Cluster to start...
Generating rack names for tasktrackers
Generating host names for tasktrackers
And the following messages are printed to console when shutting down Jaql Shell.
Shutting down the Mini HDFS Cluster
Shutting down DataNode 0
All these information will mess up with the printed JSON vale in batch mode.
MiniHBaseCluster, MiniHBaseCluster and MiniDFSCluster produce these messages.
For
details, refer to Java for ConsolePrintEnabler and
JaqlShellArguments.enableConsolePrint. Since they are out of our control, I
need to
disable console print at a global level. Maybe there is a better way to do it.
Original comment by yaojingguo@gmail.com
on 29 Sep 2009 at 5:22
[deleted comment]
Hi, Vuk
I have finished the implementation based on our discussion. Please give your
comments. I will merge my changes to svn trunk if you think that my
implementation is
OK. At the same time, I will continue to do some refactoring and test during my
vacation.
1 I don't add an option for "no echo". In batch mode, "no echo" mode is used
for Jaql
Shell. I have made some investigation on Ruby and Python. When fed with a
script,
they don't echo.
2 AbstractWriteExpr is changed to allow NON-array JSON values. XML's tree
structure
does not fit with array mode.
3 Add the following functions: stdout, csvFormat and xmlFormat. Try the
following
queries to see how they works.
- [1,2,3]->write(stdout());
- [[100,200], [300,400]]->write(stdout(csvFormat()));
- {root: {content: [1, 2, 3]}}->write(stdout(xmlFormat()));
- [[1,2],[3,4]]->write(stdout({format:
'com.ibm.jaql.io.stream.converter.LinesJsonTextOutputStream', converter:
'com.ibm.jaql.io.stream.converter.ToDelConverter'}));
Original comment by yaojingguo@gmail.com
on 30 Sep 2009 at 4:01
[deleted comment]
[deleted comment]
I have committed some updates to this branch. Some new comments:
1. If no location is provided, STDOUT is used in StreamOutputAdapter's
openStream method.
2. JsonToStream's close method is renamed to cleanUp. If the output stream is
STDOUT,
it will not be closed in cleanUp method.
3. My chagnes to AbstractWriteExpr's eval method break some test cases. I have
removed some of the test case failures. But TestModule still fails. I am
investigating on it.
Original comment by yaojingguo@gmail.com
on 1 Oct 2009 at 2:19
TestModule only fails on Cygwin. It succeeds on Linux. I have just run "ant
test" on
Ubuntu Linux. It passed.
Original comment by yaojingguo@gmail.com
on 1 Oct 2009 at 3:37
Comments (not in any particular order) ...
1. JsonToStream: why did you change the method name from close to cleanUp?
2. StreamOutputAdapter: why is there a need for a "converter"? For the stream
stuff,
we have JsonValue -> stream. With the proposed change, we have JsonValue ->
JsonValue -> stream. Why the extra step in the I/O package? The language should
already handle all JsonValue -> JsonValue conversions. Can this conversion you
propose be handled via a function?
3. StreamOutputAdapter: when a "location" is not given, the default is assumed
to be
stdout. Lets make the default be configurable-- we'll need methods to set/get
the
default and the method that creates a stream should use this default.
4. Why do I see a JsonEncoding.java.orig? Similarly, SystemNamespace.java.orig?
5. What is the "om" directory under src/test?
6. What is the intention behind the various "OutputFormatFn" classes? It seems
that
there is an assumption regarding "streams" there. Why is "format" necessarily
related to "stream". We also support csv or xml with the record-oriented output
adapters (e.g., hadoop files).
7. Lets not change "write" to accept non-arrays. If you want to use write, you
wrap
its input in an array for now. What advantages do you see for changing write's
functionality to accept non-arrays?
8. I see that JsonTextOutputStream has been removed. Did you rename it or
remove its
functionality. I think we want to retain its functionality and if one wants to
just
print one line per json value, then they can pass in a parameter to this class
(is
this the purpose of LinesJsonTextOutputStream?)
9. Let use consistent naming-- what is it, CSV or Del? It should be Del.
10. It would be good to re-use ToDelConverter as much as possible-- I see some
redundancy in code. The level of re-use in the current design forces an
extra "converter" to streams. It would be a better idea to see how we can
re-use the
code without requiring this extra parameter. Consider making an abstract for
ToDelConverter and adding the necessary support (should be minimal change) for
streams. Also, while the streaming support is important for flexibility, it is
meant
more for "small" data that is used for sequential processing instead of large
data
that know how to be partitioned (e.g., InputFormat).
Original comment by vuk.erce...@gmail.com
on 1 Oct 2009 at 6:02
[deleted comment]
Hi, Vuk
I have committed the new implementation based on your comments. Please review.
1 The reason is that the output stream will not be closed if it is STDOUT. If
you
think that close is better, I will change it back.
3 Done. Refer to StreamOutputAdapter.
4 Removed.
5 Removed.
8 It is renamed to ArrayJsonTextOutputStream. LinesJsonTextOutputStream and
ArrayJsonTextOutputStream extends AbstractJsonTextOutputStream.
ArrayJsonTextOutputStream is the old JsonTextOutputStream. Yes,
LinesJsonTextOutputStream is for printing one line per JSON value.
9 Renamed to Del.
2, 6, 7, 10
Comment 2, 6, 7 and 10 are related to each other. It seems that I have
misunderstood
your meaning. I have rewritten the implementation based on your new comments.
ToDelConverter and JsonToDelFn have wrapped JsonToDel since JsonToDelFn must
extend
Expr. The following text shows the new functions with some examples:
stdout(): By default, it uses LinesJsonTextOutputStream.
- stdout();
- [1,2,3]->write(stdout());
- [1,2,3]->write(stdout({format:
'com.ibm.jaql.io.stream.converter.ArrayJsonTextOutputStream'}));
jsonToDel(): convert a JSON value array to an array of JSON string.
- [[1,2],[3,4]]->jsonToDel();
- [[1,2],[3,4]]->jsonToDel()->write(stdout());
- [['one','two'],['three','four']]->jsonToDel();
- [['one','two'],['three','four']]->jsonToDel()->write(stdout());
jsonToXml(): convert a JSON value array to an array of JSON string.
- {root: {content: [1,2,3]}}->jsonToXml();
- {root: {content: [1, 2, 3]}}->jsonToXml()->write(stdout());
Original comment by yaojingguo@gmail.com
on 3 Oct 2009 at 5:57
1. change "cleanUp" back to "close"
2. there are several inconsistencies when it comes to "csv" vs. "del"
- package name is "csv" (com.ibm.jaql.expr.csv) but classes refer to "del"
- comments refer to "csv" in "del" files (e.g., JsonToDel)
- similar issues down test
3. I'm unclear on how JsonToDel is being tested. I see that for certain
testcases
that are expected to fail, a fail() is used. For those that are expected to
pass, I
expect to see some sort of validation against expected output. Perhaps I'm
missing
something... In either case, I think its a good idea to show tests via java but
I'd
also like to see several tests that use the current scheme (query + gold file).
4. There was a modification down hadoop specific changes for version 0.20.1.
Please
remember to make these changes to versions of this file that pertain to all
supported hadoop versions.
5. Where are we when it comes to the shell? What are the options for echoing?
Are
stdout/stderr still being redirected in certain cases? Going back to the
original
feature description, we want to make it easy for a user to output del so that
its
easy to combine w/ other programs that are hooked together through a script. So
far,
I see that the user will need to add a "write(stdout())" call. So if I just
want to
read some json, I'd do 'read(hdfs("foo")) -> write(stdout())'. What about
simply 'read(hdfs("foo"))' but on the shell speciy that the output should be
del,
not json via a flag such as "--format del"?
Original comment by vuk.erce...@gmail.com
on 5 Oct 2009 at 6:33
1 Done
2 Done
3 Fail is used to verify that a specific exception be thrown. If
the wanted exception is not thrown, the test case will fail.
For test cases expected to pass, only the result is logged.
Yes, it is better to do some verification. I will rewrite test
cases to use the existing query+gold scheme.
4 Done
5.1 For interactive mode, Jaql Shell is in echoing mode. For
batch mode, Jaql Shell is in non-echoing mode. In batch mode,
some outputs to stdout/stderror are disabled. I don't provide
an option for echoing. This decision is based on the
investigation on Python shell and Ruby shell. In interactive
mode, Python and Ruby will echo expressions. But when fed with
a script, they will print nothing to stdout unless puts/print
statements are used.
5.2 If we want to use 'read(hdfs("foo"))' to print JSON, we need
to use echoing mode. In echoing mode, a lot of jaql queries
will print text to stdout. This text will mess up with the
output of the desired JSON values. And if 'del' option is used,
jaql queries which don't produce an JSON array will cause error
since only JSON arrays are valid for del format. The same is
for xml format. So I think it is better to let the user
explicitly use 'write(stdout())' whenever he want to print
something in a Jaql script. This is also similar to the
approach taken by Ruby and Python.
Original comment by yaojingguo@gmail.com
on 9 Oct 2009 at 3:32
3. Finish the rewriting of JsonToDelTest.java and StdoutFnTest.java with the
use of
query+gold scheme.
Original comment by yaojingguo@gmail.com
on 10 Oct 2009 at 11:11
1. JsonToDelFn: if an array is expected, then it is preferable to access it via
an
iterator instead of using "eval", which materializes the input expr's value.
Using
eval allocates memory up-front... imagine if the input expression produces 20
GB of
data, it would be preferable to stream through it via iteration.
2. What does the JsonToDel tests provide in addition to the gold/query test
cases?
In addition, "eval" is not validated. I propose that we go with the gold/query
test
cases.
3. In general, lets try to group related tests. The file and test cases for
stdout
is too trivial to warrant its own test files. Furthermore, the del tests should
be
moved to the storage tests-- there you will already see handling for multiple
formats. stdout should go there as well since all other stream tests are there.
4. A common use case is a single, simple pipeline. For this case, lets make it
easy
on the user to output del if they want to. It is understood that this will not
always be meaningful (e.g., multiple statements and arbitrarily nested data).
This
does not seem to be a common operation in Python/Ruby so whether they support
this
feature or not is not an argument for our case. The changes that you made
should
remain and in addition, given a cmd-line switch, change the output format to be
del.
Original comment by vuk.erce...@gmail.com
on 12 Oct 2009 at 6:37
This code review is finished.
Original comment by yaojingguo@gmail.com
on 14 Oct 2009 at 2:39
This code review is finished.
Original comment by yaojingguo@gmail.com
on 14 Oct 2009 at 2:39
Original issue reported on code.google.com by
yaojingguo@gmail.com
on 28 Sep 2009 at 1:33