rycolab / aclpub2

MIT License
30 stars 41 forks source link

How to generate proceedings for *CL Conferences and Workshops in aclpub2 format

aclpub2 supports the generation of Proceedings and Booklets for *CL Conferences (ACL, NAACL, EMNLP, ... ) and related Workshops. This README has been created to provide the instructions to follow to generate proceedings/booklets in aclpub2 format.

The provided Python tool to generate the proceedings takes as input a set of files containing all information on the event (in the .yml format) and generates a .tex file containing the conference details, sponsors, prefaces, organizing and program committees, as well as the concatenation of all the watermarked accepted papers and the author index. Such .tex file is then compiled to generate the pdf file of the proceedings.

Before starting

Which reviewing platform is your conference/workshop using?

Table of Contents

  1. Proceedings input format and structure

  2. Expected output

  3. Manually editing yml input files

  4. How to export yml files from OpenReview

  5. Testing the tool to generate your proceedings

Proceedings input format and structure

The scripts to generate the proceedings accept as input a set of .yml files and directories. A YML file is a text document that contains data formatted using YAML (YAML Ain't Markup Language), a human-readable data format used for data serialization. You can open a YML file in any text editor (or source code editor). Examples and usage of YAML syntax can be found here.

The following .yml files should be provided to the generation scripts. Files 1, 2, 3, 4 and 6 should be manually edited with information concerning your conference/workshops, while files 5 and 7 can be automatically exported from OpenReview (or manually edited if you are not using OpenReview).

  1. conference_details.yml
  2. sponsors.yml (optional)
  3. prefaces.yml
  4. organizing_committee.yml
  5. program_committee.yml
  6. invited_talks.yml (optional)
  7. papers.yml

We strongly suggest taking a look at this link, where you can find examples of all the above files initialized for a past conference.

In addition, for the handbook, a file program.yml should be created Jump to Handbook generation instructions.

Expected output

The generated proceedings should be sent to the publication chairs as a .zip or .tgz file containing a folder named with the conference/workshop acronym. Some publication chairs prefer uploading the files to a dedicated GitHub repository.

The build process creates two directories called build and output. Note that the build directory is just temporary and is not intended to be shippied to the publication chairs. The output directory is the one to be shipped. This directory should contain all of the files that the publication chairs need, but it is always a good idea to confirm that this directory contains all of the files described below.

If you are interested in an example of the output folder, just run the software on the test case, as discussed here.

In a nutshell, such folder should contain:

  1. A PDF file named proceedings.pdf containing the whole conference/workshop proceedings (i.e., the introduction and all the watermarked PDFs of the camera ready papers).
  2. A folder named watermarked_pdf containing all the pdfs of the watermarked camera ready papers.
    • Important: this folder MUST contain the special file named 0.pdf that only contains the initial part of the proceedings (from the cover to the table of contents). The software automatically add it, but please check it, otherwise the Proceedings cannot be added to the ACL Anthology.
  3. A folder name attachments containinng all files attached to the indivual papers during their submission (e.g., the code attached to a paper). Notice that each attachment myst be correctly referred in the papers.yml file with respect to the base folder named attachments. Only in case no paper has an attachment, this folder can be omitted.
  4. A folder named inputs containing all the input files used to generate the proceedings. In particular, this folder must contain the input yml and tex files used. You can also an the not watermarked pdfs in the subfolder inputs/papers. Plase avoid to add here the attachments of the individual papers (e.g., the code or software). They must be collected in the attachments folder described below. This folder is automatically built from the software and copied in the output folder, but please remember to check it.

Upload the resulting file (ACRONYM_data.tgz) to a file server or cloud storage (e.g., Google Drive) and email the link to it to the ACL publication chairs, who will assemble them for delivery to the Anthology. Please do not send the file as an email attachment.

REALLY IMPORTANT: Before generating the final proceedings, please carefully check the input pdfs of the camera ready papers with the ACLPUBCHECK tool, a Python tool that automatically detects author formatting errors, margin violations as well as many other common formatting errors in papers that are using the LaTeX sty file associated with ACL venues. The tool and instructions to use it can be found here. We strongly suggest to share with the authors this tool before the sumbission of their final camera ready, in order to reduce the effort of controlling possibly hundreds of papers.

Manually editing yml input files

Below you can find instructions (and examples) on how you should edit the .yml files with information on your conference/workshop.

conference_details.yml

This file should contain the key information about the conference, as its name, abbreviation and so on. It is used to build the cover of the proceedings, watermarks, and other items.

Note that the ISBN of your conference/workshop will be provided by ACL.

book_title: name of the book; it should be in the form "Proceedings of ..." and it will be used in the bib file to name the event and to watermark the individual papers
event_name: name of the Conference or Workshop and it will be used in the frontmatter of the proceedeings 
cover_subtitle: the subtitle used in the cover of the proceedings, it can be in the form "Proceedings of the Conference, Vol. 1  (Long Papers)" or "Proceedings of the Workshop"
anthology_venue_id: conference/workshop abbreviation or acronym, e.g. EMNLP
start_date: Conference start date YYYY-MM-dd
end_date: Conference end date YYYY-MM-dd
isbn: ISBN number of the proceeding (assigned by the ACL)
location: location of the conference
editors: list of the editors of the volume, in the form 
  - first_name: name of the editor (e.g., John)
    middle_name: middle nanme of the editor (e.g., D.)
    last_name: surname of the editor (e.g., Walker)
publisher: published of the conference, generally "Association for Computational Linguistics"
volume_name: a tag used by the ACL Anthology to characterize the new volume in a group of proceedings. For a volume of the main conference, it should be a tag from the list long|short|srw|demo|findings. For other volumes, such as workshops, it should be set to 1
watermark_book_title: [optional] If you do not want to use the text in the book_title as a watermark, you can specify here the alternative form. It is particularly usefull when the book_title is too long: in this case you can copy that text in this field and use the line break symbol \\ and, if the text is enclosed between " ", use \\\\

Notice: avoid using LaTeX escape codes but simply use the characters in UTF8, e.g., Rilić instead of Rili'\{c})).

Here some example, first for a conference:

book_title: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Long Papers)
event_name: The 60th Annual Meeting of the Association for Computational Linguistics
cover_subtitle: Proceedings of the Conference (Long Papers)
anthology_venue_id: ACL
start_date: 2022-05-22
end_date: 2022-05-27
isbn: XXX-X-XXXXXX-XX-X (you should replace this with the real ISBN)
location: Dublin, Ireland
editors:
  - first_name: Smaranda
    last_name: Muresan
  - first_name: Preslav
    last_name: Nakov
  - first_name: Aline
    last_name: Villavicencio
publisher: Association for Computational Linguistics
volume_name: long
watermark_book_title: "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics\\\\Volume 1: Long Papers"

and for a workshop

book_title: Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval 2021)
event_name: The 2nd Workshop on Human Evaluation of NLP Systems
cover_subtitle: Proceedings of the Workshop
anthology_venue_id: HumEval
start_date: 2022-05-27
end_date: 2022-05-27
isbn: XXX-X-XXXXXX-XX-X (you should replace this with the real ISBN)
location: Dublin, Ireland
editors:
  - first_name: Belz
    last_name: Anya
  - first_name: Popović
    last_name: Maja
  - first_name: Reiter
    last_name: Ehud
  - first_name: Shimorina
    last_name: Anastasia
publisher: Association for Computational Linguistics
volume_name: 1
watermark_book_title: Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval 2021)

sponsors.yml

This file should list the sponsors (if any). A directory containing the related logos should be created in the same directory of the .yml files (named sponsor_logos/).

- tier: Name of the tier, e.g. Diamond Level or In Collaboration With
  logos:
    - Path to a logo file relative to the sponsor_logos/ directory, e.g. facebook.png

prefaces.yml

This file should list the prefaces that will be included in the proceedings. A directory containing the .tex files that provide the text of the prefaces should be created in the same directory of the .yml files (named prefaces/).

- title: Title of the preface, e.g. "Preface by the General Chair"
  file: Name of the file relative to the prefaces/ directory containing the preface text, e.g. general_chair.tex

The contents of the .tex files should not include usual headers and footers found within LaTeX files. Instead, they should only contain the contents between the \begin{document} and \end{document} directives. Frequently, this will simply be plaintext, with a few formulas, figures, or tables.

organizing_committee.yml

This file should list the members of the organizaing committee. You should edit this file manually.

- role: Name of role, e.g. General Chair
  members:
    - first_name: Committee member first name
      middle_name: Committee member middle names
      last_name: Committee member last name
      institution: Committee member's institution name as it should appear, e.g. University of California Berkeley, USA

program_committee.yml

This file should list the members of the program committee. You can edit this file manually, or export it from OpenReview Jump to How to export yml files from OpenReview.

- role: Name of role, e.g. General Chair
  members:
    - first_name: Committee member first name
      middle_name: Committee member middle names
      last_name: Committee member last name
      institution: Committee member's institution name as it should appear, e.g. University of California Berkeley, USA
- role: Reviewers
  type: name_block  # By adding the name_block type in the role, names will be output in alphabetized blocks.
  entries:
    - Committee Member Name

invited_talks.yml

This optional file should list the invited talks and associated abstracts and bios. A directory containing the .tex files that provide the text of the abstract and the bios should be created in the same directory of the .yml files (named invited_talks/). As with the prefaces, the contents of the .tex files should not include usual headers and footers found within LaTeX files, and only what is usually found between the \begin{document} and \end{document} directives.

- speaker_name: "Speaker name as it should appear, e.g., Jane Doe"
  institution: "Speaker's institution name as it should appear, e.g., University of California Berkeley, USA"
  title: "The title of the talk."
  abstract_file: "Path to the abstract's LaTeX file relative to the invited_talks/ directory, e.g., invited_talks/jane_doe_abstract.tex"
  bio_file: "Path to the bio's LaTeX file relative to the invited_talks/ directory e.g., invited_talks/jane_doe_bio.tex"
  photo: "Path to the speaker's photo, relative to the invited_talks/ directory e.g., invited_talks/jane_doe_photo.jpg"
  date: "Day of the invited talk, e.g., Mon, March 18, 2024"
  time: "Time of the invited talk, e.g., 09:00 -- 10:00"
  location: "Location of the invited talk, e.g., Room A"
  custom_prefix: "Custom title for the page, e.g., Distinguished Lecture. This field allows customizing the default title of the page. If not provided, 'Keynote' is used."

papers.yml

This file should list the accepted papers, along with a directory (named papers/) containing the associated PDFs. Each of the listed papers must have a unique ID so that they may be referred to by ID within the program.yml file later on. You can edit this file manually, or export it from OpenReview Jump to How to export yml files from OpenReview.

- id: Unique ID for the paper.
  authors:  # List of authors, structure detailed below.
    - first_name: First name e.g. Jane
      middle_name: (opt) Middle name e.g. Emily
      last_name: Last name e.g. Doe
      preferred_name: (opt) Prefered name, if not the same as first_name.
      institution: Name of the author's institution.
      email: Author's email.
      openreview: (opt) Author's OpenReview username.
      google_scholar: (opt) Author's Google Scholar ID.
      orcid: (opt) Author's ORCID ID.
      dblp: (opt) Author's DBLP ID.
      semantic_scholar: (opt) Author's Semantic Scholar ID.
  attributes:
    # Key-value pairs used to manage other aspects of
    # the publication process. Below are examples of possible
    # attributes. These attributes are not shown in the proceedings ... 
    # but these are really useful in other steps, e.g., in the 
    # definition of the program.
    paper_type: long | short
    presentation_type: oral | poster
    submitted_area: Semantics | Machine Learning | ...
  file: File name relative to the papers/ directory, e.g. 1.pdf
  attachments:
    # A list of additional files associated with the paper.
    # The type, along with one of file must be specified.
    - type: dataset | note | poster | presentation | software | attachment
      file: Local file path, e.g. 5.zip
  title: Title of the paper.
  abstract: Abstract of the paper, usually a LaTeX fragment.
  archival: Whether or not the paper is archival. Default is True, set to false to
      exclude a paper from the proceedings.

Please notice that in the field title in the attachments group it is not possible to use external urls, but only files added in the attachment folder can be referred with the relative path.

How to export yml files from OpenReview

When running your workshop on OpenReview, it is possible to use their API for automatically extracting the papers.yml and program_committee.yml files. For this purpose, in the folder openreview we provide two Python3 scripts for facilitating your work.

  1. or2papers.py: it creates the papers.yml file by extracting the papers marked as "accepted" as "Decision";

  2. or2program_committee.py: it creates the program_committee.yml file by retrieving the Senior Area Chairs list registered at workshop spate on OpenReview and the list of reviewers;

Those scripts are designed to be used by the workshop's Program Chairs due to access permission required during the queries to OpenReview. To use these scripts, you will need username (the e-mails used for login onto OpenReview), password (the password associated with the user's account), and the workshop_ID (the OpenReview identifier linked to the workshop).

Workshop ID: you can find out the workshop's identifier by following one of the two approaches below:

  1. Workshop ID is identified as "venue ID" on the setup website.

  2. Workshop ID is present at the workshop's URL. It is the ID field. For example, the ID of the ACL conference (https://openreview.net/group?id=aclweb.org/ACL/2022/Conference) is "aclweb.org/ACL/2022/Conference". Note that & is a separator in the URL. Therefore anything after it is not part of the workshop ID.

Before running the scripts

Requirements

Those scripts require Python3 and OpenReview API installed on your machine. For installing OpenReview API, please go to https://openreview-py.readthedocs.io/en/latest/how_to_setup.html

Updated data

The scripts based on OpenReview API retrieve all information directly from OpenReview. In other words, all SACs, reviewers and authors must have their OpenReview profiles updated (mainly name and affiliation).

or2papers.py

This script will find the intersection of all blind submissions and the submissions with a decision set as accepted. Those papers' information will be stored in the paper.yml file and downloaded at the "papers" and "attachments" folders. The download includes the PDF and additional attachments provided during the submission. Note that papers are randomly sorted, and different runs of the or2papers.py will return the papers sorted differently. For running or2papers.py type:

python or2papers.py USER PASSWORD WORKSHOP_ID

For example:

python or2papers.py myuser@acl.com 123456 aclweb.org/ACL/2022/Conference

The above command will complain that The output of this run cannot be used at ACLPUB2. There are two additional parameters that will ensure that the PDFs will be downloaded, namely --all and --pdfs, so you should run

python or2papers.py myuser@acl.com 123456 aclweb.org/ACL/2022/Conference --all --pdfs

or2program_committee.py

This script searches all Senior_Area_Chairs and Program_Chairs under your conference and saves their information in the program_committee.yml file.

For running or2papers.py type:

python or2program_committee.py USER PASSWORD WORKSHOP_ID

For example:

python or2program_committee.py myuser@acl.com 123456 aclweb.org/ACL/2022/Conference

:warning: Warnings

  1. The workshops that accepts the ARR commitment should be aware that the or2program_committee.py script only extracts data of submitted/committed papers.

  2. During the script execution, you may see a message such as "ERROR: or_id not found". It means that the script could not retrieve the profile's information from OpenReview. Therefore, you must insert manually the data in the paper.yml or program_committee.yml. You can identify the problematic OpenReview ID and their papers in paper.log and program_committee.log

Testing the tool to generate your proceedings

Now that you know the expected structure of the proceedings and you know how to edit/export the required .yml input files, you are ready to test the tool to automatically generate the proceedings. First of all, follow the Setup procedures.

Then, as a training example, we made at your disposal in the examples/sigdial repository all the files you would need to correctly generate the proceedings.

Could you compile the sigdial proceedings? :confetti_ball:

Excellent, you are now ready to run the generation scripts on the files you have just edited/exported for your conference/workshop.

Setup: Install python dependencies.

python -m pip install -r requirements.txt

Setup: Install Java

Java is required to use the pax latex library, which is responsible for extracting and reinserting PDF links. Visit the Java website for instructions on how to install.

Setup: Install pdflatex and associated dependencies.

Ubuntu/Debian

sudo apt-get install texlive-latex-base texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended texlive-fonts-extra texlive-bibtex-extra texlive-lang-all

OSX

Install mactex.

One way this is to install Homebrew first and then:

brew install mactex

Test Run

Ensure that PYTHONPATH includes ., for example export PYTHONPATH=.:$PYTHONPATH.

Run the CLI on the SIGDIAL example directory:

./bin/generate examples/sigdial --proceedings

The generated results, along with intermediate files and links, can then be found in the output directory in the directory in which you ran the command.

Usage

As said before, the generation scripts accepts as input the path to a directory, containing a set of .yml files and directories. This expected input directory structure and the CLI are detailed below.

CLI

# Generates the proceedings.
./bin/generate examples/sigdial --proceedings

# Generates the handbook.
./bin/generate examples/sigdial --handbook

# Generates both.
./bin/generate examples/sigdial --proceedings --handbook

# Generates both and overwrites the existing contents of the build directory.
./bin/generate examples/sigdial --proceedings --handbook --overwrite

Users may wish to make modifications to the output .tex files. Though we recommend first copying the .tex files to a new working directory, the --overwrite flag helps ensure that local modifications are not accidentally erased.

Development

The above describe a reasonable default usage of this package, but the behavior can easily be extended or modified by adjusting the contents of the aclpub2/ directory. The main files to keep in mind are aclpub2/templates/proceedings.tex which contains the core Jinja template file, and aclpub2/generate.py which is responsible for rendering the template.

Font Encoding

The input templates use the T1 font encoding. If you are interested in different encodings (e.g., Vietnamese) you have to modify the aclpub2/templates/proceedings.tex by changing the statement \usepackage[T1]{fontenc} and specifying a different encoding, e.g., \usepackage[T5]{fontenc}.

Jinja

This project makes extensive use of Jinja to produce readable Latex templates. Before contributing or forking, it is generally helpful to familiarize yourself with the Jinja library. Documentation can be found here.

Additional configuration for Jinja can be found in the aclpub2/templates.py file. The purpose of this file are to set up the Jinja environment with LaTeX-like block delimiters so that the proceedings.tex file can be syntax highlighted and otherwise interacted with in a fashion that is more natural for LaTeX users. In addition, it is also responsible for configuring some convenience functions that allow us to create some LaTeX structures in the final output .tex file that are easier to write in native Python than either the Jinja base syntax, or LaTeX alone.

Handbook generation instructions

Work in progress

program.yml

Describes the conference program. This file is organized in blocks, each with a title, start, and end time, followed by a list of papers IDs. Instead of defining presentations, sessions may define subsessions, which have the same structure as the top-level session.

- title: Title of the conference session, e.g. Opening Remarks
  start_time: Start time of the session as an ISO datestring.
  end_time: End time of the session as an ISO datestring.
  location: Location that the session is taking place in, e.g. Main Hall or Online
  chair: (opt) Name of the chair of the session, e.g. Jane Doe.
  url: (opt) URL to join or view the session, if applicable.
  papers:
  - id: Paper ID
    start_time: Optional start time of the paper slot as an ISO datestring.
    end_time: Optional start time of the paper slot as an ISO datestring.
# Or, if this is a session that is broken into subsessions:
- title: Title of the conference session, e.g. Opening Remarks
  start_time: Start time of the session as an ISO datestring.
  end_time: End time of the session as an ISO datestring.
  subsessions:
    - title: Title of the conference session, e.g. Opening Remarks
    start_time: Start time of the session as an ISO datestring.
    end_time: End time of the session as an ISO datestring.
    chair: (opt) Name of the chair of the session, e.g. Jane Doe.
    location: Location that the session is taking place in.
    papers:
    - id: Paper ID
      start_time: Optional start time of the paper slot as an ISO datestring.
      end_time: Optional start time of the paper slot as an ISO datestring.