pyro-ppl / pyroed

Bayesian optimization of discrete sequences
Apache License 2.0
19 stars 1 forks source link

Open datasets for evaluation #4

Open fritzo opened 2 years ago

fritzo commented 2 years ago

What are some open datasets for evaluation? These will be needed to answer #3 about hyperparameters and algorithms

cc @andrenguyen

fritzo commented 2 years ago

Moss et al. (2020) (section 5.2 and appendix E) evaluate their algorithm using minimum free folding energy as an objective function in optimizing short proteins, deferring to ViennaRNA to compute the objective function in experiments. Here is an example where they call the RNAfold utility as a subprocess.

We acknowledge that [minimizing minimum free-fold energy] may not be biologically meaningful on its own, however, as free-folding energy is of critical importance to other down-stream genetic prediction tasks, we believe it to be a reasonable proxy for wet-lab-based genetic design loops.

fritzo commented 2 years ago

Angermueller et al. (2020) (section 5) provide a number of in-silico benchmarking problems, including tfbind8 and tfbind10.

EWeinstein commented 2 years ago

I've worked with Tcellmatch (Fischer et al. 2020) before; it makes predictions based on short sequences (CDR3s), including variable length sequences. I believe @andrenguyen has some recent experience with this model also.