openml / openml-python

Python module to interface with OpenML
https://openml.github.io/openml-python/main/
Other
278 stars 143 forks source link

Support for Exporting Benchmarking Suites to LaTeX #1126

Open PGijsbers opened 2 years ago

PGijsbers commented 2 years ago

The best way to export basic benchmarking suite information to LaTeX is current using pandas' to_latex function. Something like:

import openml
suite = openml.study.get_suite(271)
tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) for tid in suite.tasks]
metadata = openml.datasets.list_datasets(data_id=[t.dataset_id for t in tasks], output_format="DataFrame")
metadata.style.to_latex("my-table.tex")

It would be nice if we can make this easily available. However, I don't think exporting latex exports natively is the right call. It will produce a lot of overhead (e.g., selecting columns, forwarding arguments to to_latex which may be deprecate in the future, etc.). I think our primary mode of support for this feature should be to include an example in the documentation on how to achieve this (including some simple styling examples). However, I do think we could add the function (or lazy property) of suite.metadata which returns the dataframe that is generated above (with some added task information). This allows for quick generation of LaTeX tables on the one hand while minimizing the support we would need to give for output customization on the other.

@mfeurer we discussed the functionality before, thoughts?

PGijsbers commented 2 years ago

I used the following script to roughly replicate the openml benchmark suites table:

import openml
import pandas as pd
def to_latex(suite_id, first_caption=None, second_caption="auto", label=None, filename=None):
    if second_caption == "auto":
        if first_caption.endswith("."):
            second_caption = first_caption[:-1] + " (continued)."
        else:
            second_caption = first_caption + " (continued)"

    suite = openml.study.get_suite(suite_id)
    tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) for tid in suite.tasks]

    metadata = openml.datasets.list_datasets(data_id=[t.dataset_id for t in tasks], output_format="dataframe")
    task_data = pd.DataFrame([[t.id, t.dataset_id] for t in tasks], columns=["tid", "did"]).set_index("did")
    metadata = metadata.join(task_data, on="did")

    # Prepare fields for presentation 
    metadata = metadata.rename(columns=dict(
        NumberOfInstances="n",
        NumberOfFeatures="p",
        NumberOfClasses="C",
        did="Dataset ID",
        tid="Task ID",
    ))
    metadata[["n", "p", "C"]] = metadata[["n", "p", "C"]].astype(int)

    columns_to_show = ["Task ID", "name", "n", "p"]
    if "MinorityClassSize" in metadata:
        metadata["class ratio"] = metadata["MinorityClassSize"] / metadata["MajorityClassSize"]
        columns_to_show.extend(["C", "class ratio"])
    metadata = metadata.sort_values("name", key= lambda n: n.str.lower())

    #metadata.style.to_latex("my-table.tex")
    styler = metadata[columns_to_show].style
    styler = styler.format({"class ratio": '{:,.2f}'.format})
    styler = styler.hide(axis="index")

    latex = styler.to_latex()
    latex = latex.replace("_", "\_")
    latex = latex.replace("begin{tabular}", "begin{longtable}")
    latex = latex.replace("end{tabular}", "end{longtable}")

    # Add a repeating header 
    start, header, *rows, end = latex.splitlines()
    for i in reversed(range(0, len(rows), 5)):
        rows.insert(i, r"\addlinespace")

    table_header = [
        r"\toprule",
        header,
        r"\midrule",
        r"\midrule",
    ]

    lines = [
        start,

        r"\caption{{{}}}".format(first_caption) if first_caption else "",
        r"\label{{{}}}".format(label) if label else "",
        r"\\" if first_caption or label else "",
        *table_header,
        r"\endfirsthead",

        r"\caption{{{}}}\\".format(second_caption) if second_caption else "",
        *table_header,
        r"\endhead",

        *rows,
        r"\bottomrule",
        end,

    ]

    filename = filename or f"suite-{suite_id}.tex"
    with open(filename, "w") as fh:
        fh.write("\n".join(lines))

to_latex(269, first_caption="Tasks in the AutoML regression suite.", label="tab:269")
to_latex(271, first_caption="Tasks in the AutoML classification suite.", label="tab:271")
to_latex(99, first_caption="Tasks OpenML-CC18.", label="tab:cc18")

There are too many variables and too much customization to support, in my opinion. One concession is that we might provide exactly one built-in table format (basically this) with no further customization.

mfeurer commented 2 years ago

That is a great suggestion. I think providing a standard to_latex would be great, so the tables in different papers look similar, which would reduce cognitive overhead.

Could we also somehow add a reference to a paper? Maybe that would be future work?

Also, would this look better than https://arxiv.org/pdf/2007.04074.pdf ?

PGijsbers commented 2 years ago

Could we also somehow add a reference to a paper? Maybe that would be future work?

Not entirely sure how to do this. Since we're generating LaTeX output here, one would expect the reference to be available in the .bib file which is separate (it might be possible to provide some hacks, but that brings its own set of problems). We could assist the user by providing a commented out bibtex entry in the tex file, so that they may manually include it. I think this would be nice, but I would wait until we have official support for citation information of benchmarking suites.

Also, would this look better than https://arxiv.org/pdf/2007.04074.pdf ?

The output generated here is based on the benchmarking suites paper, which does make it rather large (~30 tasks per page). I also image we'll want to change somethings (at the very least either use full names in column headers, or automatically add a description of the headers to the caption). image

mfeurer commented 1 year ago

pandas 2.0 has a new latex export based on jinja2 templates: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#dataframe-to-latex-has-a-new-render-engine

Maybe we can use that?

PGijsbers commented 1 year ago

Seems reasonable at first glance, though I don't have time to convert the code (or test it gives qualitatively similar results). Since this also introduces another dependency, maybe we should consider providing this a functionality as part of some independent "contrib" package (e.g., openml-python-latex) instead of openml-python itself. What do you think? I would still reference the package in the docs.

mfeurer commented 1 year ago
PGijsbers commented 1 year ago

Optional dependency is also fine with me.

pushkal00 commented 1 year ago

Hey @mfeurer, @PGijsbers, I've been going through the project and exploring potential good first issues. It seems that there's a discussion about implementing some custom functionality related to Jinja2 and LaTeX in the project. The key takeaway from the conversation is that this functionality would be optional. Could you please provide more details about what exactly you're envisioning for this feature? It would be helpful to get a clearer understanding of what's expected.

PGijsbers commented 1 year ago

We would envision a to_latex method that exports information about datasets as a latex table, similar to the one displayed above. If the method uses pandas 2.0's latex export that is based on jinja2-templates, then the jinja2 dependency should be optional for openml-python. The implementation also needs to work for large lists of datasets which may generate multi-page tables. If you have specific questions, feel free to ask them here.

pushkal00 commented 1 year ago

Thanks @PGijsbers to summarize!!!

So here is the algorithm based on my thinking:

def to_latex(dataframe,columns,header,index):
            try:  
                pandas.to_latex 
                return
            except:
                # Our custom to_latex
                     ##How many columns should be allowed
                    For multi page table, it should return different latex each time.

Is it good to go? In which file we should add the code?

PGijsbers commented 1 year ago

Please have a look at the 2.0 to_latex for pandas. I don't think you would encounter any exception, though alternation of the latex code after generation may be necessary (I haven't looked into this yet, I don't know if the 2.0 version is flexible). Multi-page tables shouldn't require multiple calls to the function, the returned latex code should simply represent a multi-page table. Overall, I don't think the new implementation would be very different from the one provided in the original post, but it should be updated to use the new 2.0 syntax and broken down into smaller function with smaller responsibility (e.g., one function to generate the dataframe with the aliased column names, one to generate the latex code, one to write it to disk). Where possible

ps. It's possible we might have misjudged the difficulty of this issue, don't feel obliged to stick with it. If you'd rather pick up a different issue, you would also be more than welcome to :)

jot-s-bindra commented 11 months ago

My thinking :


import openml
import pandas as pd
from pandas.compat import StringIO

class BenchmarkingSuite:
    def __init__(self, suite_id):
        self.suite = openml.study.get_suite(suite_id)
        self.tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) for tid in self.suite.tasks]

    def _get_metadata_dataframe(self):
        data_ids = [t.dataset_id for t in self.tasks]
        metadata = openml.datasets.list_datasets(data_id=data_ids, output_format="DataFrame")
        return metadata

    def to_latex(self, output_file="my-table.tex", multi_page=False):
        metadata = self._get_metadata_dataframe()

        # If multi_page is True, split the metadata into chunks for multi-page tables
        if multi_page:
            metadata_chunks = [metadata[i:i+50] for i in range(0, len(metadata), 50)]  # Adjust chunk size as needed
        else:
            metadata_chunks = [metadata]

        # Prepare LaTeX template
        latex_template = r"""
        \documentclass{article}
        \usepackage{longtable}
        \begin{document}
        """

        if multi_page:
            latex_template += r"\begin{longtable}{%s}" % "c" * len(metadata.columns)
        else:
            latex_template += r"\begin{tabular}{%s}" % "c" * len(metadata.columns)

        latex_template += r"\hline"
        latex_template += " & ".join(metadata.columns) + r"\\"
        latex_template += r"\hline"
        latex_template += r"\endfirsthead"
        latex_template += r"\hline"
        latex_template += " & ".join(metadata.columns) + r"\\"
        latex_template += r"\hline"
        latex_template += r"\endhead"
        latex_template += r"\hline"
        latex_template += r"\endfoot"
        latex_template += r"\endlastfoot"

        for chunk in metadata_chunks:
            latex_template += chunk.to_latex(escape=False, index=False)

        latex_template += r"""
        \end{longtable}
        \end{document}
        """

        # Write LaTeX template to the specified output file
        with open(output_file, "w") as f:
            f.write(latex_template)

# Usage
benchmarking_suite = BenchmarkingSuite(271)
benchmarking_suite.to_latex(output_file="my-table.tex", multi_page=True)