srstevenson / nb-clean

Clean Jupyter notebooks for version control. Remove metadata, outputs, and execution counts with Git and pre-commit support.
https://pypi.org/project/nb-clean
ISC License
135 stars 18 forks source link

add remove metadata #169

Closed yasirroni closed 1 year ago

yasirroni commented 1 year ago

Add new argument for clean and filter "-M" or "--remove-notebook-metadata" to "remove notebook metadata.

    clean_parser.add_argument(
        "-M",
        "--remove-notebook-metadata",
        action="store_true",
        help="remove notebook metadata",
    )

re-PR from https://github.com/srstevenson/nb-clean/pull/163

yasirroni commented 1 year ago

Fails due to pylint:

poetry run pylint src  
************* Module nb_clean
src\nb_clean\__init__.py:147:0: R0913: Too many arguments (6/5) (too-many-arguments)

-----------------------------------
Your code has been rated at 9.95/10
yasirroni commented 1 year ago

Maybe refactor to keyword-only-argument? See SO answer and PEP 3102.

yasirroni commented 1 year ago

Any update, @srstevenson

The fails is not mandatory and can be fixed by making all args as keyword-only-argument

srstevenson commented 1 year ago

I've held off from merging this because I'm not yet convinced there's a real user requirement for it. Most of the notebook metadata doesn't change frequently, and so doesn't contribute to the noise in VCS diffs which nb-clean was written to address.

yasirroni commented 1 year ago

Different editor such as Google Colab vs VSCode and different Python version across user produce different metadata. That is the use case.

Not convinced enough?

haplav commented 1 year ago

@srstevenson:

Most of the notebook metadata doesn't change frequently, and so doesn't contribute to the noise in VCS diffs which nb-clean was written to address.

I would like to support the idea of this PR (can't yet assess its quality wrt. nb-clean coding standards). As I mentioned in https://github.com/srstevenson/nb-clean/issues/157#issuecomment-1401977349, the notebook-level metadata actually do make noise in git commits.

For example, this metadata gets changed right after opening in VS Code without even running the notebook, and populated with things like conda environment name or some weird hash.

For me, the ability to filter out those is entirely in the spirit of nb-clean.

Example

Note: nb-clean was run after each step

1) initial state

 "metadata": {
  "language_info": {
   "name": "python"
  },

2) after opening in web browser and saving

 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },

3) after opening in VS Code and saving

 "metadata": {
  "kernelspec": {
   "display_name": "scratch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "8951b5bc4efd704e8ce7b1700f3d36aa2c5348526bfde6695033584eb49dd3bb"
   }
  }
 },

This is on the same machine using the same conda environment (!) Needless to say that if my colleague happens to open the notebook on his computer, edit some stuff, save and do a git commit with the nb-clean hook, the commit will include his edits PLUS several metadata changes.

yasirroni commented 1 year ago

@haplav

"metadata": { "language_info": { "name": "python" },

Since notebook metadata language_info is indeed needed, current PR remove this metadata. How about add other parameter such as --add-language-metadata NAME, with default name to Python?

yasirroni commented 1 year ago

Another approach is to change this PR into --preserve-metadata language_info metadata2 metadata3. But, support of hierarchy of metadata sometimes are needed.

But, based on your example, the language_info is destroyed and replaced with codemirror_mode. That is another problem. And that is not what happen on my side. Here is my case: Original:

 "metadata": {
  "language_info": {
   "name": "python"
  }

After VSCode open:

 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.10 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.8.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "..."
   }
  }
haplav commented 1 year ago

"metadata": { "language_info": { "name": "python" },

Since notebook metadata language_info is indeed needed, current PR remove this metadata. How about add other parameter such as --add-language-metadata NAME, with default name to Python?

I don't understand @yasirroni, I don't mind having this in the cleaned notebook with hard-wired "name": "python". It seems rather unlikely this would need to be changed.

haplav commented 1 year ago

Another approach is to change this PR into --preserve-metadata language_info metadata2 metadata3. But, support of hierarchy of metadata sometimes are needed.

I don't think we need to complicate things right now.

But, based on your example, the language_info is destroyed and replaced with codemirror_mode. That is another problem. And that is not what happen on my side. Here is my case: Original:

Ah OK, I think you read it wrong; language_info is NOT destroyed; codemirror_mode gets just ADDED to it.

haplav commented 1 year ago

Please let GitHub apply suggestions and resolve discussions automatically as described in https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request

yasirroni commented 1 year ago

Please let GitHub apply suggestions and resolve discussions automatically as described in https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request

Did not know that feature exist. Sorry, Sir.

yasirroni commented 1 year ago

Ah OK, I think you read it wrong; language_info is NOT destroyed; codemirror_mode gets just ADDED to it.

Yeah, but the structure changes, since name no longer the child of language_info.

"metadata": { "language_info": { "name": "python" },

Since notebook metadata language_info is indeed needed, current PR remove this metadata. How about add other parameter such as --add-language-metadata NAME, with default name to Python?

I don't understand @yasirroni, I don't mind having this in the cleaned notebook with hard-wired "name": "python". It seems rather unlikely this would need to be changed.

But, I also works in Octave and octave-based notebook did not use that metadata. Here is a new notebook in Octave created using jupyterlab:

 "metadata": {
  "kernelspec": {
   "display_name": "Octave",
   "language": "octave",
   "name": "octave"
  }
 },
yasirroni commented 1 year ago

I think, two options might be good.

Hard clean: like this PR. Soft clean: like this PR, but skip for

"metadata": {
  "kernelspec": {
   "language": "..."
  },
  "language_info": {
   "name": "...",
  },
 },
haplav commented 1 year ago

Yeah, but the structure changes, since name no longer the child of language_info.

No. Please, take a good look - name is contained in kernelspec and codemirror_mode but also still directly in language_info:

 "metadata": {
  "kernelspec": {
   ...
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "python",
    ...
   },
   ...
   "name": "python",
   ...
  },
  ...
 },

I think, two options might be good.

Once again, I'm fine with keeping just metadata/language_info/name (if already present). It seems to be the same in all cases you and I mentioned. But it doesn't make any sense to preserve kernelspec exactly because

But, I also works in Octave and octave-based notebook did not use that metadata. Here is a new notebook in Octave created using jupyterlab:

 "metadata": {
  "kernelspec": {
   "display_name": "Octave",
   "language": "octave",
   "name": "octave"
  }
 },

It's changed based on what interpreter the kernel uses. Keeping just empty metadata, as this PR already does, looks also fine to me. I don't think any interpreter would depend on having anything in there.

yasirroni commented 1 year ago

No. Please, take a good look - name is contained in kernelspec and codemirror_mode but also still directly in language_info:

@haplav oops. i'm sorry. It will be easier than. _ Okay, maybe just let this PR as it is and later add feature to skip language_info

haplav commented 1 year ago

@srstevenson any chance to get this merged?

github-actions[bot] commented 1 year ago

This PR was closed due to inactivity. Please reopen if still relevant.

yasirroni commented 3 months ago

Any chance this can be re-opened? I'm willing to fix merge conflict. @srstevenson

srstevenson commented 3 months ago

163 is still open with the same branch, so let's use that PR instead.