Closed jabowery closed 2 years ago
Thanks for the issue but this sounds more like a research proposal rather an issue for gym. Currently, gym is maintained by a group of volunteers with no staff from OpenAI so Im not sure how we can help you make this proposal a reality. Personally, I would contact academic researchers in this area who are more likely a better help than we could be
Anyone is free to do this kind of a project, but it's definitely not suitable for the core gym library.
Proposal
Include the Hutter Prize corpus (enwik9) as a "game" for the purpose of sample-efficient reinforcement language modeling.
Motivation
Recently, EfficientZero has demonstrated sample-efficient reinforcement learning in the gym. The interest in large language models generally ignores 3 deficiencies: model size, sample size and planning. Ignoring model size is particularly egregious as it implicitly ignores the strongest theorem in unsupervised model selection: Solomonoff Induction (ie: approximation of Algorithmic Information of the sense data) to generate optimal predictions. The sample sizes of the LLM corpora are intractable for all but the wealthiest institutions. The lack of LLM planning ability implies a general inability to perform in dynamical environments.
Pitch
Consider a simple "game" that consists of one of 2 moves each time step -- 0 or 1 -- resulting in a positive or negative reinforcement based on whether it predicts the next bit in the enwik9 corpus or not. Obviously, the agent could simply contain the enwik9 corpus and score perfectly. However, this very fact points to the motivation for Solomonoff Induction as part of an AGI that must learn from its environment. Therefore, while the utility function for the agent would remain its hits-misses, the SoTA measure would be the size of the agent itself -- the degree to which it approaches the Solomonoff Induction model (aka the Algorithmic Information content of the agent's observations). While there are bound to be a variety of ways of measuring this size, as well as a variety of ways of incorporating resource utilization (TPUs, CPUs, time, etc.) into any SoTA metric, such is the case with other SoTA measures in the gym.
Such a reinforcement learning agent would define "scientist" in rigorously operational terms and, in approximating the actual information content of Wikipedia, provide epistemological insights into that critical resource.
Finally, as EfficientZero is a sample-efficient derivative of MuZero and MuZero has inspired the bold hypothesis that "Reward Is Enough" it would be most interesting to see the various approaches to reward-driven science arising from the competitive environment of the gym.
Alternatives
The Hutter Prize for Lossless Compression of Human Knowledge and The Large Text Compression Benchmark were both constructed to advance epistemology through knowledge modeling, but the field's lack of familiarity with Solomonoff's fundamental finding in AGI, combined with the profligacy of computational resources and data, has led to a situation in which SoTA performance metrics are largely eliding a foundational component of AGI.
Checklist