Add way to load curriculum from file

coreylowman commented 2 years ago

This is semi related to the discussion in #38. The CLI will need a way to load in a curriculum from a file. Options are:

Dependency injection approach by dynamically importing an unknown python file
Constructing a curriculum via configuration in yaml
Have a registry of curriculums that can be referred to by string name (a la gym environments)
???

coreylowman commented 2 years ago

@cash @gkvallabha has issues with loading curriculums from yaml files, you should touch base with him

coreylowman commented 2 years ago

I think the only thing using this would be if we wanted to pass in a path to the curriculum via command line (i.e. the current CLI), right?

We may want to punt on this issue & the cli plan in favor of telling people to import their agent, import the curriculum, and call run_experiment in tella.

cash commented 2 years ago

We can have a registry of current curricula and then can specify the key on the command line. I don't see a reason to abandon the CLI yet.

gkvallabha commented 2 years ago

When we did this with TEF (in L2M Phase 1), we used a data-driven approach (a JSON file for the curriculum). This quickly ran into limitations, e.g.

compactly representing repeatable blocks (e.g., if I have Task1 for 100 exps, Task2 for 200 exps, Task3 for 100 exps, and I want that sequence to repeat)
allowing systematic change of parameter value (e.g., I want to gradually change a parameter for a task from 1.0 to 2.0). With a JSON file, I have to manually "unroll" this and explicitly specific the value for each parameter step
supporting different ways of setting parameter values (e.g., I want to sample a parameter value from N(0,5) or from Unif(-2,2), or may want other distributions like Exponential).

My takeaway was that a data-driven approach is not scalable and potentially hard to debug/understand.

I understand the security concern of doing dynamic imports, though realistically, users are going to be running they got from a GitHub repo either way. It seems to me that a good alternative is to ask users to set up a short runner script in Python and invoke it (slightly more work for users, but on the flip side, it allows an explicit specification of each "experiment").

Additional point re JSON specification

see here for the learnkit approach to data-driven curricula, see the JSON files in particular. Note that it specified the task name ($learnkit:sample_classification_tasks.NumberData) so that the task could be loaded and verified as a valid task, and its parameters could be checked, which in turn involved dynamic loading. This could be avoided by having a master list of tasks somewhere but this doesn't scale well. @cash

cash commented 2 years ago

I need more experience with our curriculums to have an opinion here.

I'm more concerned about having an undocumented implicit schema for configuration making validation and creation difficult than security issues with importing an arbitrary python module.

Early in development it can make sense to have the flexibility of a full scripting language for configuration. If after a while there are a smallish number of primitives in the configuration, it can be really useful to codify them as a schema and separate out the data from the code. I don't know if that is the case here.

gkvallabha commented 2 years ago

We don't quite know the full range of (lifelong) curricula. This is a pretty novel area, so we are feeling our way through this space ... I don't think the performers have a good idea either at present.

Another possibility (other than separate data from code) -- the curriculum designer can use some specified APIs (e.g., subclass from an abstract class provide implementations) as building blocks .. .it isn't as strongly constrained as a data schema but can still provide some way to ensure the curriculum is put together in a reasonable way (e.g., like specifying a BNF).

lifelong-learning-systems / tella

Add way to load curriculum from file #57