omkardash / jaql

Automatically exported from code.google.com/p/jaql
0 stars 0 forks source link

Support NON-JSON streaming out in Jaql Shell #45

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Vuk's Original Design

I think this would be a useful feature, in particular, when jaql is used in
batch mode for the purpose of feeding other programs produced by jaql
scripts. I don't think it makes much sense for interactive mode, but you
let me know what you think about this.

If one calls the shell in batch mode, all outputs should be written with
the given format, say CSV, XML, etc. My guess is that this will be in
particular useful for the case where there is only one query and they want
the output piped to another program.

Now, how to specify the format? One option is as you suggest-- add an
option to the shell (e.g., --format csv) that forces top-level writes to be
formatted accordingly. As you've seen in Jaql, I/O is specified through
descriptors (e.g., {type: 'local', location: 'foo.json', options: {adapter:
"...", format: "..."}}) (you can see the examples in
conf/storage-default.jql or src/test/com/ibm/jaql/storageQueries.txt). One
options is to have the argument to --format have a corresponding I/O
descriptor. For example, you may have the argument to --format (e.g.,
'csv'), be the key in the storage registry (e.g., storage-default.jql).
Instead of the FileStreamOutputAdapter?, I'd use a StreamOutputAdapter?
that you bind to System.out-- then all should work as if writing to any
stream. In summary, the things to do are:

   1. add a option to jaql shell for --format 

   2. in JaqlShell?, if --format is provided, so long as we're in batch
mode, the format is valid, and the format derives from StreamOutputAdapter?, 

        setup a StreamOutputAdapter? that is bound to System.out,
initialized properly, etc.

   3. Of course, to get this to work with CSV, you'll have to add a CSV
entry to storage-default.jql first (I'd play with this first in interactive
mode, then move 

        to the other tasks)

Next, do we only want to support options that have an entry in
storage-default.jql? What if I have a new format I want to use? Given
jaql's architecture, this shouldn't be a big deal. Just like when an IO
descriptor is given for read/write, one could conceive of passing in such a
descriptor to "--format". This will make argument parsing a bit trickier
(will need to parse json-- if its a string, then its a key, if its a
record, its a descriptor) but its doable. What do you think of this extra
generality?

Let me know if some of the above doesn't make sense. 

Original issue reported on code.google.com by yaojingguo@gmail.com on 22 Sep 2009 at 2:37

GoogleCodeExporter commented 9 years ago
Design Summary

1 NON-JSON streaming out is only supported in batch mode.
2 Option "--format" is added to Jaql Shell. Three output formats 
  (json, csv and xml) are pre-defined. But general Jaql IO 
  descriptors are also supported.

Original comment by yaojingguo@gmail.com on 22 Sep 2009 at 2:43

GoogleCodeExporter commented 9 years ago
Another way to get this functionality is to support writing to stdout (e.g., a 
print
functoin). To me, this approach sounds more attractive because it can reuse
everything we did for serialization. Also, when printing the output of a script 
as
CSV, you probably don't want the output of all the statements, but only some of 
them.
Using print would help here.

Original comment by Rainer.G...@gmx.de on 22 Sep 2009 at 8:32

GoogleCodeExporter commented 9 years ago
In either case, this feature should be implemented using the serialization 
framework. It is basically like a global switch for the shell's output format. 
I 
think the intention here was to make it convenient for small, ad-hoc scripts 
(probably one expression) to dump out their output as CSV (or some other 
format).

Implementing this feature so that it is easily re-usable (e.g., write(stdout()) 
) is 
a good idea.

Speaking of the serialization framework, we should unify it with the 
StreamAdapters 
(think of these as types of stream factories + formatting functionality). This 
is a 
topic for another item however... 
http://code.google.com/p/jaql/issues/detail?id=47

Original comment by vuk.erce...@gmail.com on 23 Sep 2009 at 2:01

GoogleCodeExporter commented 9 years ago
Two Approaches
================

I describe the new approach in the following text.
- Add a function instead of providing output format option to 
  launch Jaql Shell.
  - This gives us more flexibility to control the output format.  
    But there is a little problem. In current implementation, 
    Jaql Shell prints the evaluated value of an JSON expression.  
    If write(stdout("xml")) is called, two formats of data (JSON 
    and XML) will be sent to STDOUT.  This also applies to 
    situations where stdout("xml") and stdout("csv") are used in 
    the same script.  Do we really want this kind of flexibility?
  - And since write function can take IO descriptor as parameter, 
    Jaql users  can still use new format as IO descriptor other 
    than the pre-defined formats. 
- Add the stdout function with minArgs = 0 and maxArgs =1. The 
  only parameter specifies the output format (json, csv and xml).  
  The default is json. write(stdout()) will write the content to 
  the STDOUT. We may need to tweak write function since it now 
  prints IO descriptor to STDOUT.

I prefer the original approach. But we can make stdout function 
as a global switch for the shell's output format. For example, if 
stdout("csv") is called, all the output will be in csv format.

Other
==============
- JsonOutputStream and JsonTextOutputStream should not be used.  
  These 2 classes also use the serialization framework.  
  JsonTextOutputStream uses JsonUtil which in turn uses the 
  serialization framework. JsonOutputStream uses 
  DefaultBinaryFullSerializer directly. To Rainer, could you 
  share me with some documents for the serialization framework if 
  you have? To Vuk, could you explain more about how to use 
  serialization framework directly?

- For CSV support, I want to reuse the mechanism in 
  ToDelConverter.

- For XML output format, I want to reuse some existing libraries.  
  I find that the following 2 libraries can provides JSON-To-XML 
  conversion functionality.
  - http://www.json.org/java/index.html
  - http://json-lib.sourceforge.net/usage.html
  Do you have any suggestions with regard to these libraries?  Do 
  you have any suggestions for the XML representation of JSON? 

Original comment by yaojingguo@gmail.com on 23 Sep 2009 at 3:56

GoogleCodeExporter commented 9 years ago
write(stdout(...)) sounds good. The main advantage that I see is that you can 
control
the output. This is important in the main use case of the CSV feature, that is,
piping Jaql results into other programs. Other opinions?

Original comment by Rainer.G...@gmx.de on 23 Sep 2009 at 4:19

GoogleCodeExporter commented 9 years ago
Yes, the stream converter classes use serializers. The issue raised is whether 
or 
not we should consider a different design where the stream converters are types 
of 
serializers instead of some other interface (converter)?

Regarding XML, we've taken a stab at a conversion from XML to JSON (see 
com.ibm.jaql.lang.expr.xml.XmlToJsonFn). It would be useful if the writer was 
consistent with this reader.

Original comment by vuk.erce...@gmail.com on 23 Sep 2009 at 5:51

GoogleCodeExporter commented 9 years ago
Yes, after I do more investigation of Jaql functions, I agree to 
go with function approach. I have begun the implementation in 
this way.

Original comment by yaojingguo@gmail.com on 24 Sep 2009 at 3:28

GoogleCodeExporter commented 9 years ago
The following 2 functions are added to support JSON streaming 
output in XML and CSV formats.
- jsonToCsv is for JSON streaming output in CSV format
- jsonToXml is for JSON streaming output in XML format
Jaql users can use these 2 functions in both interactive mode and 
batch mode. With these 2 functions, Jaql users have the full 
control of JSON output. 

Original comment by yaojingguo@gmail.com on 28 Sep 2009 at 1:13

GoogleCodeExporter commented 9 years ago
Fixed in Revision 397.
Summary of changes.

1. Function jsonToDel and jsonToXml are added. 

2. json, del and xml registry entries are added to storage-default.jaql.
3. Add the following option to JAQL shell
 -o (--outoptions) <outoptions>    output options: json, del and xml or an
                                   output IO descriptor. This option is
                                   ignored when not running in batch mode.

2.1 In batch mode, JAQL shell prompt and echoing of input JAQL queries are 
disabled.
If no output options are provided, JAQL shell print outputs in the same format 
as in
interactive mode. If output options are provided. JAQL shell use file 
descriptor to
print output. Any file descriptor can be specified.

jaqlshell -b data.json: prints the output in the same format as in interactive 
mode.
jaqlshell -b -o json data.json: use json file descriptor to print output in 
json format.
jaqlshell -b -o del data.json: use del file descriptor to print output in del 
format.
jaqlshell -b -o "{type: 'del', outoptions: {fields: ['name', 'age']}}" 
data.json: use
del file descriptor with field names specified.
jaqlshell -b -o xml data.json: use xml file descriptor to print output in xml 
format.
jaqlshell -b -o "{type: 'local', location: 'abc'}" data.json: use the given file
descriptor top print output into a local file.

4. Rename the existing def function to defHdfs.
5. The output printing logic of Jaql has been extracted to JaqPrinter.
6. ToDelConverter has been moved from vendor directory to src/java directory.

Original comment by yaojingguo@gmail.com on 16 Oct 2009 at 3:19