marimo-team / marimo

A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.
https://marimo.io
Apache License 2.0
6.14k stars 200 forks source link

Inquiry about a jupytext-like tool for Marimo notebooks. #2124

Closed hongyi-zhao closed 2 weeks ago

hongyi-zhao commented 2 weeks ago

Dear Marimo Development Team,

I am a user of Marimo notebooks and have been impressed with their functionality and ease of use. I am writing to inquire about the possibility of developing a tool similar to jupytext for Marimo notebooks.

Specifically, I am interested in:

  1. A tool to convert Marimo notebooks to standard Python scripts, removing Marimo-specific syntax while preserving the code and logic.
  2. The ability to convert these Python scripts back into Marimo notebooks, if possible.
  3. Options for exporting Marimo notebooks to other formats (e.g., plain Python, Jupyter notebooks) while maintaining as much functionality as possible.

I understand that Marimo notebooks are already Python files, which simplifies this process compared to other notebook formats. However, a dedicated tool could streamline the conversion process, especially for notebooks with complex UI elements or Marimo-specific features.

Thank you for your time and consideration. I look forward to hearing your thoughts on this matter.

Best regards, Zhao

akshayka commented 2 weeks ago

These are all possible today using the marimo CLI.

A tool to convert Marimo notebooks to standard Python scripts, removing Marimo-specific syntax while preserving the code and logic.

marimo export script notebook.py > script.py

Options for exporting Marimo notebooks to other formats (e.g., plain Python, Jupyter notebooks) while maintaining as much functionality as possible.

marimo export ipynb notebook.py > notebook.ipynb

Additionally, Jupyter to marimo:

marimo convert notebook.ipynb > notebook.py

Python script back to marimo: this requires using Jupytext but is ready done:

  1. jupytext --to notebook test.py to generate test.ipynb
  2. marimo convert test.ipynb > test.py to generate a marimo notebook.

For more info, run:

  1. marimo export --help — also supports exporting to html and markdown
  2. marimo convert --help

Hope that helps! Please close this issue if it solves your problems, otherwise feel free to follow up.

hongyi-zhao commented 2 weeks ago

Thank you very much for your quick and systematic reply.

Additionally, Jupyter to marimo:

marimo convert notebook.ipynb > notebook.py

I have tried the following and the generated py file has so many errors when running in marimo by marimo edit Untitled2-marimo.py:

$ marimo --version
0.8.3

$ marimo convert Untitled2.ipynb > Untitled2-marimo.py

Here are the above-mentioned files: Untitled2.zip.

akshayka commented 2 weeks ago

Are you seeing multiple definition errors? The main constraint that marimo imposes on notebooks is that the same variable cannot be defined in multiple cells. You can make variables local to a cell by prefixing their names with an underscore.

See the documentation here: https://docs.marimo.io/guides/reactivity.html

Or run marimo tutorial intro at the command-line to learn more,

While this constraint may take some getting used to, it is what enables marimo to do its magic -- to make your notebooks reproducible, reactive, executable as scripts, and runnable as apps.

If you are seeing a different error, let me know what you're seeing and I can help.

hongyi-zhao commented 2 weeks ago

The problematic script is as follows:

import marimo

__generated_with = "0.8.3"
app = marimo.App()

@app.cell
def __():
    from mp_api.client import MPRester

    def get_column_width(docs, attr, header, transform=None):
        return max(len(header), max((len(str(transform(getattr(doc, attr)) if transform else getattr(doc, attr))) for doc in _docs)))
    with _MPRester() as _mpr:
        _docs = _mpr.materials.summary.search(chemsys='Si', fields=['material_id', 'formula_pretty', 'theoretical', 'symmetry', 'energy_per_atom', 'energy_above_hull'])
        _experimental_docs = [doc for doc in _docs if not doc.theoretical]
        _experimental_docs.sort(key=lambda x: x.energy_per_atom)
        columns = [('Material ID', 'material_id', str), ('Formula', 'formula_pretty', str), ('Space Group', 'symmetry', lambda x: x.symbol), ('Crystal System', 'symmetry', lambda x: x.crystal_system), ('Energy per Atom (eV)', 'energy_per_atom', lambda x: f'{x:.4f}'), ('Energy Above Hull (eV)', 'energy_above_hull', lambda x: f'{x:.4f}')]
        widths = [get_column_width(_experimental_docs, attr, header, transform) for header, attr, transform in columns]
        fmt = ' | '.join(['{:<' + str(width) + '}' for width in widths])
        print(f'Found {len(_experimental_docs)} experimentally observed Si structures:\n')
        print(fmt.format(*(column[0] for column in columns)))
        print('-' * (sum(widths) + 3 * (len(columns) - 1)))
        for doc in _experimental_docs:
            print(fmt.format(*(transform(getattr(doc, attr)) for _, attr, transform in columns)))
    return MPRester, columns, doc, fmt, get_column_width, widths

@app.cell
def __():
    from mp_api.client import MPRester
    with _MPRester() as _mpr:
        available_fields = _mpr.materials.summary.available_fields
        print('Available fields for materials.summary:')
        for field in sorted(available_fields):
            print(f'- {field}')
    return MPRester, available_fields, field

@app.cell
def __():
    from mp_api.client import MPRester
    from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
    from pymatgen.io.cif import CifWriter
    with _MPRester() as _mpr:
        _docs = _mpr.materials.summary.search(chemsys='Si', fields=['material_id', 'energy_per_atom', 'nsites', 'formation_energy_per_atom', 'theoretical'])
        _experimental_docs = [doc for doc in _docs if not doc.theoretical]
        lowest_energy_doc = min(_experimental_docs, key=lambda x: x.energy_per_atom)
        lowest_energy_id = lowest_energy_doc.material_id
        structure = _mpr.get_structure_by_material_id(lowest_energy_id)
    sga = SpacegroupAnalyzer(structure)
    spacegroup = sga.get_space_group_symbol()
    spacegroup_number = sga.get_space_group_number()
    refined_structure = sga.get_refined_structure()
    print('Lowest energy experimentally observed Si structure:')
    print(f'Material ID: {lowest_energy_id}')
    print(f'Space group: {spacegroup} (#{spacegroup_number})')
    print(f'Formula: {refined_structure.composition.reduced_formula}')
    print(f'Number of sites: {lowest_energy_doc.nsites}')
    print(f'Energy per atom: {lowest_energy_doc.energy_per_atom:.4f} eV')
    print(f'Formation energy per atom: {lowest_energy_doc.formation_energy_per_atom:.4f} eV')
    print(f'Lattice parameters: a={refined_structure.lattice.a:.4f}, b={refined_structure.lattice.b:.4f}, c={refined_structure.lattice.c:.4f}')
    print(f'Lattice angles: alpha={refined_structure.lattice.alpha:.2f}, beta={refined_structure.lattice.beta:.2f}, gamma={refined_structure.lattice.gamma:.2f}')
    print('\nAtomic positions:')
    for site in refined_structure:
        print(f'{site.specie} at {site.frac_coords}')
    cif = CifWriter(refined_structure, symprec=0.1)
    cif.write_file('lowest_energy_Si_experimental.cif')
    print("\nCIF file 'lowest_energy_Si_experimental.cif' has been created with the correct space group information.")
    return (
        CifWriter,
        MPRester,
        SpacegroupAnalyzer,
        cif,
        lowest_energy_doc,
        lowest_energy_id,
        refined_structure,
        sga,
        site,
        spacegroup,
        spacegroup_number,
        structure,
    )

@app.cell
def __():
    return

if __name__ == "__main__":
    app.run()

I tried several times, but still couldn't solve the problem.

akshayka commented 2 weeks ago

@hongyi-zhao , here is a fixed version of your notebook. Importing a module counts as defining it, so the same module shouldn't be imported in multiple cell. In this case MPRester was being imported in multiple cells. The solution is to move that import into just one cell.

import marimo

__generated_with = "0.8.3"
app = marimo.App()

@app.cell
def __():
    from mp_api.client import MPRester
    return MPRester,

@app.cell
def __(MPRester):
    def get_column_width(docs, attr, header, transform=None):
        return max(
            len(header),
            max(
                (
                    len(
                        str(
                            transform(getattr(doc, attr))
                            if transform
                            else getattr(doc, attr)
                        )
                    )
                    for doc in _docs
                )
            ),
        )

    with MPRester() as _mpr:
        _docs = _mpr.materials.summary.search(
            chemsys="Si",
            fields=[
                "material_id",
                "formula_pretty",
                "theoretical",
                "symmetry",
                "energy_per_atom",
                "energy_above_hull",
            ],
        )
        _experimental_docs = [doc for doc in _docs if not doc.theoretical]
        _experimental_docs.sort(key=lambda x: x.energy_per_atom)
        columns = [
            ("Material ID", "material_id", str),
            ("Formula", "formula_pretty", str),
            ("Space Group", "symmetry", lambda x: x.symbol),
            ("Crystal System", "symmetry", lambda x: x.crystal_system),
            ("Energy per Atom (eV)", "energy_per_atom", lambda x: f"{x:.4f}"),
            ("Energy Above Hull (eV)", "energy_above_hull", lambda x: f"{x:.4f}"),
        ]
        widths = [
            get_column_width(_experimental_docs, attr, header, transform)
            for header, attr, transform in columns
        ]
        fmt = " | ".join(["{:<" + str(width) + "}" for width in widths])
        print(
            f"Found {len(_experimental_docs)} experimentally observed Si structures:\n"
        )
        print(fmt.format(*(column[0] for column in columns)))
        print("-" * (sum(widths) + 3 * (len(columns) - 1)))
        for doc in _experimental_docs:
            print(
                fmt.format(
                    *(
                        transform(getattr(doc, attr))
                        for _, attr, transform in columns
                    )
                )
            )
    return columns, doc, fmt, get_column_width, widths

@app.cell
def __(MPRester):
    with MPRester() as _mpr:
        available_fields = _mpr.materials.summary.available_fields
        print("Available fields for materials.summary:")
        for field in sorted(available_fields):
            print(f"- {field}")
    return available_fields, field

@app.cell
def __(MPRester):
    from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
    from pymatgen.io.cif import CifWriter

    with MPRester() as _mpr:
        _docs = _mpr.materials.summary.search(
            chemsys="Si",
            fields=[
                "material_id",
                "energy_per_atom",
                "nsites",
                "formation_energy_per_atom",
                "theoretical",
            ],
        )
        _experimental_docs = [doc for doc in _docs if not doc.theoretical]
        lowest_energy_doc = min(
            _experimental_docs, key=lambda x: x.energy_per_atom
        )
        lowest_energy_id = lowest_energy_doc.material_id
        structure = _mpr.get_structure_by_material_id(lowest_energy_id)
    sga = SpacegroupAnalyzer(structure)
    spacegroup = sga.get_space_group_symbol()
    spacegroup_number = sga.get_space_group_number()
    refined_structure = sga.get_refined_structure()
    print("Lowest energy experimentally observed Si structure:")
    print(f"Material ID: {lowest_energy_id}")
    print(f"Space group: {spacegroup} (#{spacegroup_number})")
    print(f"Formula: {refined_structure.composition.reduced_formula}")
    print(f"Number of sites: {lowest_energy_doc.nsites}")
    print(f"Energy per atom: {lowest_energy_doc.energy_per_atom:.4f} eV")
    print(
        f"Formation energy per atom: {lowest_energy_doc.formation_energy_per_atom:.4f} eV"
    )
    print(
        f"Lattice parameters: a={refined_structure.lattice.a:.4f}, b={refined_structure.lattice.b:.4f}, c={refined_structure.lattice.c:.4f}"
    )
    print(
        f"Lattice angles: alpha={refined_structure.lattice.alpha:.2f}, beta={refined_structure.lattice.beta:.2f}, gamma={refined_structure.lattice.gamma:.2f}"
    )
    print("\nAtomic positions:")
    for site in refined_structure:
        print(f"{site.specie} at {site.frac_coords}")
    cif = CifWriter(refined_structure, symprec=0.1)
    cif.write_file("lowest_energy_Si_experimental.cif")
    print(
        "\nCIF file 'lowest_energy_Si_experimental.cif' has been created with the correct space group information."
    )
    return (
        CifWriter,
        SpacegroupAnalyzer,
        cif,
        lowest_energy_doc,
        lowest_energy_id,
        refined_structure,
        sga,
        site,
        spacegroup,
        spacegroup_number,
        structure,
    )

@app.cell
def __():
    return

if __name__ == "__main__":
    app.run()
hongyi-zhao commented 2 weeks ago

Thank you very much for your fixing.

Additional remark:

@app.cell
def __(MPRester):
    def get_column_width(docs, attr, header, transform=None):
        return max(
            len(header),
            max(
                (
                    len(
                        str(
                            transform(getattr(doc, attr))
                            if transform
                            else getattr(doc, attr)
                        )
                    )
                    for doc in _docs
                )
            ),
        )

The code line for doc in _docs

should be changed into:

for doc in docs

Otherwise, the following error will be triggered:

Traceback (most recent call last):
  File "/home/werner/.pyenv/versions/datasci/lib/python3.11/site-packages/marimo/_runtime/executor.py", line 170, in execute_cell
    exec(cell.body, glbls)
  Cell marimo:///home/werner/11.py#cell=cell-1
, line 41, in <module>
    widths = [
             ^
  Cell marimo:///home/werner/11.py#cell=cell-1
, line 42, in <listcomp>
    get_column_width(_experimental_docs, attr, header, transform)
  Cell marimo:///home/werner/11.py#cell=cell-1
, line 13, in get_column_width
    for doc in _docs
               ^^^^^
NameError: name '_docs' is not defined
hongyi-zhao commented 2 weeks ago

It would be great if we could avoid frequently using the underscore prefix in multiple cells. Additionally, in the result obtained from using the marimo export script notebook.py > script.py, these underscore prefixes are also retained, making the code look not elegant.

mscolnick commented 2 weeks ago

@hongyi-zhao, the underscore is to signify its a private variable to the cell and should not be leaked into the global state (and accessed by other cells). This was a conscious design decision that, while open to discuss (maybe in the discussions tab), likely won't be changed in the near future.

I am going to close out this issue since I think we have answered the original issue, but please feel free to reply if not.

hongyi-zhao commented 2 weeks ago

See https://github.com/marimo-team/marimo/discussions/2152 for the related discussion.