tensorflow / java

Java bindings for TensorFlow
Apache License 2.0
832 stars 202 forks source link

Converting TensorFlow markdown to JavaDoc text in op_generator #213

Open JimClarke5 opened 3 years ago

JimClarke5 commented 3 years ago

@karllessard @Craigacp

I have been experimenting with converting the TF Markdown text to JavaDoc format in the op_generator code. I did this by creating another c++ class, that calls out to Python using the Python C library. This runs the Python marko package with my own marko renderer class javadoc_renderer.JavaDocRenderer that converts markdown to JavaDoc. In the C++ class, SourceWriter, I call out to the python code to convert the Markdown text to JavaDoc. The converted JavaDoc code is then written out to the class.

Here is an example of the old and new generated JavaDoc for org.tensorflow.op.math.Abs:

Current JavaDoc:

/**
 * Computes the absolute value of a tensor.
 * <p>
 * Given a tensor `x`, this operation returns a tensor containing the absolute
 * value of each element in `x`. For example, if x is an input element and y is
 * an output element, this operation computes \\(y = |x|\\).
 * 
 * @param <T> data type for {@code y()} output
 */

New JavaDoc:

/**
 * <p>Computes the absolute value of a tensor.</p>
 * <p>
 * <p>Given a tensor <code>x</code>, this operation returns a tensor containing the absolute
 * value of each element in <code>x</code>. For example, if x is an input element and y is
 * an output element, this operation computes \(y = |x|\).</p>
 * 
 * @param <T> data type for {@code y()} output
 */

There still needs some tweaks to JavaDoc output, like <p> on a single line. Also, I am still chasing down an infrequent error where the conversion string gets garbled.

I have made several design decision that should probably be discussed. For example, I put my Python module in bazel-binand point the PYTHONPATH to it in build.sh.

env PYTHONPATH=:$BAZEL_BIN/markdown_javadoc $BAZEL_BIN/java_op_generator \
    --output_dir=$GEN_SRCS_DIR \
    --api_dirs=$BAZEL_SRCS/external/org_tensorflow/tensorflow/core/api_def/base_api,src/bazel/api_def \
    $TENSORFLOW_LIB

Also, I cannot figure out how to bring in the python library from the framework into the BUILD file. For now, I have it hard coded.

tf_cc_binary(
    name = "java_op_generator",
    linkopts = select({
        "@org_tensorflow//tensorflow:windows": [],
        "//conditions:default": [
            "-lm",
            "-L/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/config-3.7m-darwin",
            "-lpython3.7"
          ],
        }),
    deps = [
        ":java_op_gen_lib",
    ],
)

Any help on setting the bazel rules for include the python library would be appreciated.

I did find @org_tensorflow//third_party/python_runtime:headers, which I added as a dependency in the cc_library section of BUILD. This allowed me to compile the c++ code with the Python.h header.

cc_library(
    name = "java_op_gen_lib",
    srcs = [
        "src/bazel/op_generator/op_gen_main.cc",
        "src/bazel/op_generator/op_generator.cc",
        "src/bazel/op_generator/op_specs.cc",
        "src/bazel/op_generator/source_writer.cc",
        "src/bazel/op_generator/markdown_javadoc.cc",
    ],
    hdrs = [
        "src/bazel/op_generator/java_defs.h",
        "src/bazel/op_generator/op_generator.h",
        "src/bazel/op_generator/op_specs.h",
        "src/bazel/op_generator/source_writer.h",
        "src/bazel/op_generator/markdown_javadoc.h",
    ],
    copts = tf_copts(),
    deps = [
        "@org_tensorflow//tensorflow/core:framework",
        "@org_tensorflow//tensorflow/core:lib",
        "@org_tensorflow//tensorflow/core:op_gen_lib",
        "@org_tensorflow//tensorflow/core:protos_all_cc",
        "@org_tensorflow//third_party/python_runtime:headers",
        "@com_googlesource_code_re2//:re2",
    ],
)

I can create a draft PR if you want to look at the whole project, so we can iterate on some of the design decisions, and figure out how to link with the Python C library in a bazel friendly way.

saudet commented 3 years ago

BTW, the only reason we're doing this in Bazel in the first place is because the original op generator was written in C++. We can and should rewrite it in Java, and then when that's done AFAIK we won't need to do anything with Bazel or C++ anymore. In fact, we can also call CPython functions very easily from Java using JavaCPP without having to deal with anything C++: https://github.com/bytedeco/javacpp-presets/tree/master/cpython#the-simplejava-source-file

karllessard commented 3 years ago

@saudet I agree but rewriting the op generator in Java won't be an easy task though, if Jim have something close to be working in C++ we should give it a try.

The thing is that even if we can reformat it right, the code examples in the doc will still be Python. I know there is 1000+ ops available in TF but the right way for completely transcripting the markdown to Javadoc would be to rewrite it in our API definitions (under src/bazel/api_def, that is what they are meant for).

What do you think @JimClarke5 that we run your script just once to generate the doc in our API def files and then we slowly but surely fix the issues as we find them manually? And we could run that script to generate the doc of the new ops everytime we upgrade the TF runtime version

JimClarke5 commented 3 years ago

@karllessard There were some issues when I originally was trying write it back out to the src/bazel/api_def, the main issue was parsing the api_def file. op_generator does parse the api_def file with C++ code. I will look at doing a 100% java program using antlr. I will need to parse the api_def file, then parse the markdown.

saudet commented 3 years ago

I'm pretty sure it just uses protobuf to parse all this, which we can use easily enough from Java, or am I missing something?

JimClarke5 commented 3 years ago

It is protobuf like, but not 100% the same. I found grammars-v4/protobuf3/Protobuf3.g4 on the antlr4 site, but there is no op or op_name defined in that grammar.