iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link
alternative-polyadenylation benchmark bioinformatics open-science rna-seq

APAeval

GitHub license All Contributors DOI:zenodo DOI:biorxiv

Welcome to the APAeval GitHub repository.

Quick links

APAeval is a community effort that was born as the APAeval hackathon at the RNA 2021 Conference. We are aiming to evaluate computational methods for the detection and quantification of poly(A) sites from RNA-seq samples in an open, reproducible and extensible manner.

logo

Overview of APAeval benchmarking

APAeval currently consists of three benchmarking events, each consisting of a set of challenges for bioinformatics methods (=participants) that use RNA-seq data to:

  1. Identify polyadenylation sites
  2. Report poly(A) site expression as absolute quantification in TPM
  3. Report relative expression of poly(A) sites within transcripts

We'd still like to set up a fourth event to evaluate tools that calculate differential usage of polyadenylation sites. If you'd like to contribute, continue reading below.

schema

  1. As described above, APAeval consists of three benchmarking events to evaluate the performance of different tasks that the methods of interest (=participants) might be able to perform: PAS identification, absolute quantification, and relative quantification. A method can participate in one, two or all three events, depending on its functions.
  2. Raw data: For challenges within the benchmarking events, APAeval is using data from several different selected publications. Generally, one dataset (consisting of one or more samples) corresponds to one challenge (here, datasets for challenges x and y are depicted). All raw RNA-seq data is processed with nf-core/rna-seq for quality control and mapping. For each dataset we provide a matching ground truth file, created from 3’ end seq data from the same publications as the raw RNA-seq data, that will be used in the challenges to assess the performance of participants. You can find an overview of RNA-seq and matching ground truth samples in the APAeval Zenodo snapshot.
  3. Sanctioned input files: The processed input data is made available in .bam format. Additionally, for each dataset a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS (not shown), are provided.
  4. In order to evaluate each participant in different challenges, a re-usable “method workflow” has to be written in either Snakemake or Nextflow. Within this workflow, all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 3.), to the output specified by APAeval in their metrics specifications (see 5.) have to be performed.
  5. To ensure compatibility with the workflows of the benchmarking events, specifications for file formats (output of method workflows = input for benchmarking workflows) are provided by APAeval.
  6. Within a benchmarking event, one or more challenges will be performed. A challenge is primarily defined by the input dataset used for performance assessment. Results of a challenge (metrics) are computed for each participant within a "benchmarking workflow".
  7. In order to compare the performance of participants, results for each participant are uploaded to the OEB database, where metrics for all participants are visualized per challenge.

What can you do?

Use a benchmarked method on your own RNA-seq data

Firstly, you might want to check our manuscript or our OpenEBench site to find the method that would perform best for your use case. If you have decided on a method to use, head over to the method workflows section in this repo and follow the instructions in the README.md of the method of your choice. All our method workflows are built in either Snakemake or Nextflow, and use containers for individual steps to ensure reproducibility and reusability. For instructions on how to set up a conda environment for running APAeval workflows see here.

You'll need to have your RNA-seq data ready in .bam format. No idea how to get there? You could check out the nf-core RNA-Seq analysis pipeline or other tools such as ZARP.

Benchmark a new method

Have you developed a new computational method for investigating APA from RNA-seq data? Or are you interested in one of the tools we haven't managed to include in APAeval yet? We'd be very happy if you decided to contribute to APAeval!

In order to ensure reproducibility of the benchmarks, as well as reusability and shareability of the benchmarked method, you'd start by writing an APAeval style method workflow. That workflow will take .bam files as an input, and create .bed files compatible with the specification for the respective APAeval benchmarking event. Create a PR (pull request; please ask in our Github discussions board to be added to APAeval as a collaborator, or create the PR from a fork) in this repo and wait for your request to be approved. You can then run the workflow on the data for all APAeval challenges and use the resulting .bed files in the corresponding APAeval benchmarking workflow in order to compare the performance of your tool to the APAeval ground truths. Finally you can submit your metrics .json files to us and we'll take care of including them in our OEB site.

Extend APAeval's benchmarks

One of the main goals of APAeval is to provide extensible benchmarking, such that new tools, new challenges or new metrics can be added at any time. Therefore we warmly welcome any contribution to the project. A good starting point would be to visit our issue and discussion boards. The latter one is also the place where you can reach out to us and request we add you to the repo as a collaborator (alternatively, create your PRs from a fork). You can then take on an existing task, suggest a new one, or start a discussion.

Some technical stuff

OpenEBench

We are partnering with OpenEBench, a benchmarking and technical monitoring platform for bioinformatics tools. OpenEBench development, maintenance and operation is coordinated by Barcelona Supercomputing Center (BSC) together with partners from the European Life Science infrastructure initiative ELIXIR.

OpenEBench tooling will facilitate the computation and visualization of benchmarking results and store the results of all benchmarking events and challenges in their databases, making it easy for others to explore results. This should also make it easy to add additional participants to existing benchmarking events later on. OpenEBench developers are also advising us on creating benchmarks that are compatible with good practices in the wider community of bioinformatics challenges.

APAeval conda environment

For reproducible execution of our workflows (both method and benchmarking workflows) we're using a conda environment with fixed versions of Snakemake, Nextflow, some python packages, and Singularity. Make sure you have conda installed and from the root directory of this repo create the APAeval environment with

conda env create -f apaeval_env.yaml

You can then activate it with:

conda activate apaeval

NOTE: If you're working on Windows or Mac, you might have to google about setting up a virtual machine for running Singularity.

ANOTHER NOTE: If you run into problems regarding root access & Singularity with the described setup, try removing Singularity installation from the apaeval_env.yaml and install it independently.

You can now execute the workflows!

Tutorials

Here are some pointers and tutorials for the main software tools that we are using at APAeval:

Conda: tutorial
Docker: tutorial
Git: tutorial
GitHub: general tutorial / GitHub flow tutorial
Nextflow: tutorial
Singularity: tutorial
Snakemake: tutorial

Code of Conduct

Please be kind to one another and mind the Contributor Covenant's Code of Conduct for all interactions with the community. A copy of the Code of Conduct is also shipped with this repository. Please report any violations to the Code of Conduct to apaeval@irnacosi.org.

Open Science, licenses & attribution

Following best practices for writing software and sharing data and code is important to us, and therefore we want to apply, as much as possible, FAIR Principles to data and software alike. This includes publishing all code open source, under permissive licenses approved by the Open Source Initiative and all data by a permissive Creative Commons license.

In particular, we publish all code under the MIT license and all data under the CC0 license. An exception are all benchmarking workflows, which are published under the GPLv3 license, as the provided template is derived from an OpenEBench example workflow that is itself licensed under GPLv3. A copy of the MIT license is also shipped with this repository.

We also believe that attribution, provenance and transparency are crucial for an open and fair work environment in the sciences, especially in a community effort like APAeval. Therefore, we would like to make clear from the beginning that in all publications deriving from APAeval (journal manuscript, data and code repositories), any non-trivial contributions will be acknowledged by authorship.

We expect that all contributors accept the license and attribution policies outlined above.

Get in touch

If you would like to contribute to APAeval or have any questions, we'd be happy to hear from you via our Github Discussions board. If you already have a specific issue in mind, feel free to add it to our issues board. You can also reach out to apaeval@irnacosi.org.

How to cite APAeval

If APAeval was useful for you in your work, please cite our manuscript:

Extensible benchmarking of methods that identify and quantify polyadenylation sites from RNA-seq data
Sam Bryce-Smith, Dominik Burri, Matthew R. Gazzara, Christina J. Herrmann, Weronika Danecka, Christina M. Fitzsimmons, Yuk Kei Wan, Farica Zhuang, Mervin M. Fansler, José M. Fernández, Meritxell Ferret, Asier Gonzalez-Uriarte, Samuel Haynes, Chelsea Herdman, Alexander Kanitz, Maria Katsantoni, Federico Marini, Euan McDonnel, Ben Nicolet, Chi-Lam Poon, Gregor Rot, Leonard Schärfen, Pin-Jou Wu, Yoseop Yoon, Yoseph Barash, Mihaela Zavolan
bioRxiv 2023.06.23.546284; doi: https://doi.org/10.1101/2023.06.23.546284

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Chelsea Herdman
Chelsea Herdman

📆 📋 🤔 👀 📢 📖
ninsch3000
ninsch3000

💻 🔣 📖 🎨 📋 🧑‍🏫 📆 💬 👀 📢 🤔 🐛
Euan McDonnell
Euan McDonnell

💻 🤔 🧑‍🏫
Alex Kanitz
Alex Kanitz

🐛 💻 📖 💡 📋 🤔 🚇 🚧 🧑‍🏫 📆 💬 👀 📢
Yuk Kei Wan
Yuk Kei Wan

🐛 📝 💻 🔣 📖 💡 📋 🤔 🧑‍🏫 📆 💬 ⚠️
Ben
Ben

🔣 🤔 📆
pjewell-biociphers
pjewell-biociphers

🚧
mzavolan
mzavolan

🔣 📖 📋 💵 🤔 🧑‍🏫 📆 💬 👀 📢
Mervin Fansler
Mervin Fansler

🐛 💻 📖 📋 🤔 🧑‍🏫 📆 💬 👀
Maria Katsantoni
Maria Katsantoni

💻 🤔 🧑‍🏫 💬
daneckaw
daneckaw

💻 🔣 📋 🤔 📆
Dominik Burri
Dominik Burri

🐛 💻 🔣 📖 💡 📋 🤔 🚇 🧑‍🏫 📆 💬 ⚠️
mrgazzara
mrgazzara

💻 📖 🔣 📋 🤔 🚇 🚧 📆 🧑‍🏫 📢
Christina Fitzsimmons
Christina Fitzsimmons

📖 📋 🤔 📆 📢
Leo Schärfen
Leo Schärfen

💻 🤔 📢
poonchilam
poonchilam

💻 🤔 💬
dseyres
dseyres

💻 📖 🤔
Pierre-Luc
Pierre-Luc

🔣 📖 📋 🤔 📆
SamBryce-Smith
SamBryce-Smith

💻 🤔 🐛 📖 🚧 🧑‍🏫 📆 💬 👀 📢
Pin-Jou Wu
Pin-Jou Wu

💻 🤔
yoseopyoon
yoseopyoon

💻 🤔
Farica Zhuang
Farica Zhuang

🐛 💻 📖 🤔 🚧 📆 💬 👀
Asier Gonzalez
Asier Gonzalez

🐛 💻 💡 🤔 🚇 🧑‍🏫 📆 💬
txellferret
txellferret

💻 💡 🤔 🚇 🧑‍🏫 💬
Gregor Rot
Gregor Rot

🐛 💻 🤔 🚧 👀
José María Fernández
José María Fernández

🤔 🚇 🧑‍🏫

This project follows the all-contributors specification. Contributions of any kind welcome!