kynan / nbstripout

strip output from Jupyter and IPython notebooks
Other
1.19k stars 95 forks source link

added support for stripping output from Zeppelin Notebooks #130

Closed ankitrokdeonsns closed 3 years ago

ankitrokdeonsns commented 4 years ago

Zeppelin Notebooks are popular especially while working with Apache Spark Framework. Since the basic structure of the IPython / Jupyter and Zeppelin Notebooks is similar it would be cool to provide nbstripout support for Zeppelin Notebooks as well.

kynan commented 4 years ago

Thank you for the pull request. There's a few points to discuss:

  1. Your PR only addresses the code path of calling nbstripout with a file argument. There is also the (more common) way of reading from stdin, which is used in the git filter.

  2. To handle the case of reading from stdin we need to either

    • detect the file type or
    • specify the file type using a flag

    My preference is being explicit using a flag e.g. --format which defaults to jupyter

  3. We'd need to decide how to handle the various flags nbstripout supports. Do they have equivalents for Zeppelin? Does Zeppelin have execution counts? Or notebook and cell level metadata? Tags?

    If it doesn't make sense to support these flags, what do we want to do when running in "Zeppelin mode" and an incompatible flag is passed? Ignore silently? Ignore with a warning? Fail?

Please also add tests and update the documentation as part of this pr.

ankitrokdeonsns commented 4 years ago

Thank you for the pull request. There's a few points to discuss:

  1. Your PR only addresses the code path of calling nbstripout with a file argument. There is also the (more common) way of reading from stdin, which is used in the git filter.

Correct. Zeppelin notebooks (v0.9+) have *.zpln extension which can be used for setting up the git filter. Will be doing so in upcoming commits.

  1. To handle the case of reading from stdin we need to either

    • detect the file type or
    • specify the file type using a flag

    My preference is being explicit using a flag e.g. --format which defaults to jupyter

.zpln extension should take care of this. For older Zeppelin notebooks it would be a little more tricky since they are stored as note.json files

  1. We'd need to decide how to handle the various flags nbstripout supports. Do they have equivalents for Zeppelin? Does Zeppelin have execution counts? Or notebook and cell level metadata? Tags? If it doesn't make sense to support these flags, what do we want to do when running in "Zeppelin mode" and an incompatible flag is passed? Ignore silently? Ignore with a warning? Fail?

Yes some similar metadata is present like last execution time. I will update more on this.

Please also add tests and update the documentation as part of this pr.

Will do.

ankitrokdeonsns commented 4 years ago

As far as I know Zeppelin does not have support for adding metadata or tags to cells. So keep-output is currently not possible for Zeppelin notebooks. Zeppelin notebooks also do not have execution count. So keep-count also cant be supported. We can ignore such things silently for zeppelin notebooks. At least for now.

ankitrokdeonsns commented 4 years ago

Structure of Zeppelin Notebook cell:

{
      "text": "%md\n# Hello",
      "user": "anonymous",
      "dateUpdated": "2020-07-08 16:46:54.732",
      "config": {
        "colWidth": 12.0,
        "fontSize": 9.0,
        "enabled": true,
        "results": {},
        "editorSetting": {
          "language": "text",
          "editOnDblClick": false,
          "completionKey": "TAB",
          "completionSupport": true
        },
        "editorMode": "ace/mode/text"
      },
      "settings": {
        "params": {},
        "forms": {}
      },
      "results": {},
      "apps": [],
      "progressUpdateIntervalMs": 500,
      "jobName": "paragraph_1594203861277_1152517137",
      "id": "paragraph_1594203861277_1152517137",
      "dateCreated": "2020-07-08 15:54:21.277",
      "dateStarted": "2020-07-08 16:46:54.740",
      "dateFinished": "2020-07-08 16:46:54.854",
      "status": "FINISHED"
    }

We could consider to clear keys such as dateCreated, dateStarted, dateFinished, status while we clear the output. What do you think?

kynan commented 4 years ago

That sounds reasonable! I'm still in favor of having a flag to --format which defaults to jupyter and also supports zeppelin, since that will work both with file input and reading from stdin.

ankitrokdeonsns commented 4 years ago

Ok. Currently In My solution --format is not necessarily required but we could use that for supporting older versions of zeppelin notebooks. I will add that flag as you suggested. 👍

ankitrokdeonsns commented 4 years ago

Sure. I was busy last week. I will add the test tomorrow and update the pull request.

ankitrokdeonsns commented 4 years ago

I have incorporated changes as you suggested. Added tests and updated documentation. I believe now it is ready for merge. Do let me know if I missed anything.

ankitrokdeonsns commented 4 years ago

Hi @kynan I believe this PR is ready for merge.

ankitrokdeonsns commented 3 years ago

incorporated latest requested changes

ankitrokdeonsns commented 3 years ago

Good catch @kynan. I think this remained due to bad merging during rebase. Thanks. Deleted the extra block.