Open PGijsbers opened 2 years ago
I used the following script to roughly replicate the openml benchmark suites table:
import openml
import pandas as pd
def to_latex(suite_id, first_caption=None, second_caption="auto", label=None, filename=None):
if second_caption == "auto":
if first_caption.endswith("."):
second_caption = first_caption[:-1] + " (continued)."
else:
second_caption = first_caption + " (continued)"
suite = openml.study.get_suite(suite_id)
tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) for tid in suite.tasks]
metadata = openml.datasets.list_datasets(data_id=[t.dataset_id for t in tasks], output_format="dataframe")
task_data = pd.DataFrame([[t.id, t.dataset_id] for t in tasks], columns=["tid", "did"]).set_index("did")
metadata = metadata.join(task_data, on="did")
# Prepare fields for presentation
metadata = metadata.rename(columns=dict(
NumberOfInstances="n",
NumberOfFeatures="p",
NumberOfClasses="C",
did="Dataset ID",
tid="Task ID",
))
metadata[["n", "p", "C"]] = metadata[["n", "p", "C"]].astype(int)
columns_to_show = ["Task ID", "name", "n", "p"]
if "MinorityClassSize" in metadata:
metadata["class ratio"] = metadata["MinorityClassSize"] / metadata["MajorityClassSize"]
columns_to_show.extend(["C", "class ratio"])
metadata = metadata.sort_values("name", key= lambda n: n.str.lower())
#metadata.style.to_latex("my-table.tex")
styler = metadata[columns_to_show].style
styler = styler.format({"class ratio": '{:,.2f}'.format})
styler = styler.hide(axis="index")
latex = styler.to_latex()
latex = latex.replace("_", "\_")
latex = latex.replace("begin{tabular}", "begin{longtable}")
latex = latex.replace("end{tabular}", "end{longtable}")
# Add a repeating header
start, header, *rows, end = latex.splitlines()
for i in reversed(range(0, len(rows), 5)):
rows.insert(i, r"\addlinespace")
table_header = [
r"\toprule",
header,
r"\midrule",
r"\midrule",
]
lines = [
start,
r"\caption{{{}}}".format(first_caption) if first_caption else "",
r"\label{{{}}}".format(label) if label else "",
r"\\" if first_caption or label else "",
*table_header,
r"\endfirsthead",
r"\caption{{{}}}\\".format(second_caption) if second_caption else "",
*table_header,
r"\endhead",
*rows,
r"\bottomrule",
end,
]
filename = filename or f"suite-{suite_id}.tex"
with open(filename, "w") as fh:
fh.write("\n".join(lines))
to_latex(269, first_caption="Tasks in the AutoML regression suite.", label="tab:269")
to_latex(271, first_caption="Tasks in the AutoML classification suite.", label="tab:271")
to_latex(99, first_caption="Tasks OpenML-CC18.", label="tab:cc18")
There are too many variables and too much customization to support, in my opinion. One concession is that we might provide exactly one built-in table format (basically this) with no further customization.
That is a great suggestion. I think providing a standard to_latex
would be great, so the tables in different papers look similar, which would reduce cognitive overhead.
Could we also somehow add a reference to a paper? Maybe that would be future work?
Also, would this look better than https://arxiv.org/pdf/2007.04074.pdf ?
Could we also somehow add a reference to a paper? Maybe that would be future work?
Not entirely sure how to do this. Since we're generating LaTeX output here, one would expect the reference to be available in the .bib
file which is separate (it might be possible to provide some hacks, but that brings its own set of problems). We could assist the user by providing a commented out bibtex entry in the tex file, so that they may manually include it. I think this would be nice, but I would wait until we have official support for citation information of benchmarking suites.
Also, would this look better than https://arxiv.org/pdf/2007.04074.pdf ?
The output generated here is based on the benchmarking suites paper, which does make it rather large (~30 tasks per page). I also image we'll want to change somethings (at the very least either use full names in column headers, or automatically add a description of the headers to the caption).
pandas 2.0 has a new latex export based on jinja2 templates: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#dataframe-to-latex-has-a-new-render-engine
Maybe we can use that?
Seems reasonable at first glance, though I don't have time to convert the code (or test it gives qualitatively similar results). Since this also introduces another dependency, maybe we should consider providing this a functionality as part of some independent "contrib" package (e.g., openml-python-latex
) instead of openml-python
itself. What do you think? I would still reference the package in the docs.
Optional dependency is also fine with me.
Hey @mfeurer, @PGijsbers, I've been going through the project and exploring potential good first issues. It seems that there's a discussion about implementing some custom functionality related to Jinja2 and LaTeX in the project. The key takeaway from the conversation is that this functionality would be optional. Could you please provide more details about what exactly you're envisioning for this feature? It would be helpful to get a clearer understanding of what's expected.
We would envision a to_latex
method that exports information about datasets as a latex table, similar to the one displayed above. If the method uses pandas 2.0's latex export that is based on jinja2-templates, then the jinja2 dependency should be optional for openml-python. The implementation also needs to work for large lists of datasets which may generate multi-page tables. If you have specific questions, feel free to ask them here.
Thanks @PGijsbers to summarize!!!
So here is the algorithm based on my thinking:
def to_latex(dataframe,columns,header,index):
try:
pandas.to_latex
return
except:
# Our custom to_latex
##How many columns should be allowed
For multi page table, it should return different latex each time.
Is it good to go? In which file we should add the code?
Please have a look at the 2.0 to_latex for pandas. I don't think you would encounter any exception, though alternation of the latex code after generation may be necessary (I haven't looked into this yet, I don't know if the 2.0 version is flexible). Multi-page tables shouldn't require multiple calls to the function, the returned latex code should simply represent a multi-page table. Overall, I don't think the new implementation would be very different from the one provided in the original post, but it should be updated to use the new 2.0 syntax and broken down into smaller function with smaller responsibility (e.g., one function to generate the dataframe with the aliased column names, one to generate the latex code, one to write it to disk). Where possible
ps. It's possible we might have misjudged the difficulty of this issue, don't feel obliged to stick with it. If you'd rather pick up a different issue, you would also be more than welcome to :)
My thinking :
import openml
import pandas as pd
from pandas.compat import StringIO
class BenchmarkingSuite:
def __init__(self, suite_id):
self.suite = openml.study.get_suite(suite_id)
self.tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) for tid in self.suite.tasks]
def _get_metadata_dataframe(self):
data_ids = [t.dataset_id for t in self.tasks]
metadata = openml.datasets.list_datasets(data_id=data_ids, output_format="DataFrame")
return metadata
def to_latex(self, output_file="my-table.tex", multi_page=False):
metadata = self._get_metadata_dataframe()
# If multi_page is True, split the metadata into chunks for multi-page tables
if multi_page:
metadata_chunks = [metadata[i:i+50] for i in range(0, len(metadata), 50)] # Adjust chunk size as needed
else:
metadata_chunks = [metadata]
# Prepare LaTeX template
latex_template = r"""
\documentclass{article}
\usepackage{longtable}
\begin{document}
"""
if multi_page:
latex_template += r"\begin{longtable}{%s}" % "c" * len(metadata.columns)
else:
latex_template += r"\begin{tabular}{%s}" % "c" * len(metadata.columns)
latex_template += r"\hline"
latex_template += " & ".join(metadata.columns) + r"\\"
latex_template += r"\hline"
latex_template += r"\endfirsthead"
latex_template += r"\hline"
latex_template += " & ".join(metadata.columns) + r"\\"
latex_template += r"\hline"
latex_template += r"\endhead"
latex_template += r"\hline"
latex_template += r"\endfoot"
latex_template += r"\endlastfoot"
for chunk in metadata_chunks:
latex_template += chunk.to_latex(escape=False, index=False)
latex_template += r"""
\end{longtable}
\end{document}
"""
# Write LaTeX template to the specified output file
with open(output_file, "w") as f:
f.write(latex_template)
# Usage
benchmarking_suite = BenchmarkingSuite(271)
benchmarking_suite.to_latex(output_file="my-table.tex", multi_page=True)
The best way to export basic benchmarking suite information to LaTeX is current using pandas'
to_latex
function. Something like:It would be nice if we can make this easily available. However, I don't think exporting latex exports natively is the right call. It will produce a lot of overhead (e.g., selecting columns, forwarding arguments to
to_latex
which may be deprecate in the future, etc.). I think our primary mode of support for this feature should be to include an example in the documentation on how to achieve this (including some simple styling examples). However, I do think we could add the function (or lazy property) ofsuite.metadata
which returns the dataframe that is generated above (with some added task information). This allows for quick generation of LaTeX tables on the one hand while minimizing the support we would need to give for output customization on the other.@mfeurer we discussed the functionality before, thoughts?