Open jaanli opened 6 months ago
Here's something that could help prototype this? https://colab.research.google.com/github/jaanli/language-model-notebooks/blob/main/notebooks/getting-started.ipynb
See also #434, which concerns making "learn" and "ask" functionality available in the magic commands.
Problem
I am always copy and pasting context for large language models to experience less verbal hallucinations, and to ground them using techniques such as in-context learning (appending positively and negatively labeled examples to prompts).
This is similar to prompt optimization methods such as those implemented in DSPy (https://github.com/stanfordnlp/dspy).
I currently use this bash script that claude.ai wrote in order to copy and paste the contents of the current directory I need help with for a variety of software engineering, machine learning, writing, research tasks for non-profit and teaching work:
https://gist.github.com/jaanli/5def01b7bd674efd6d9008cf1125986d
Usage of this script:
/usr/local/bin/copy.sh
copy.sh
Proposed Solution
Add a directive or argument called
--context
or something like that that enables reading of a JSON object using this type of syntax:(Example from https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#specifying-a-runtime-environment-per-job)
The JSON object might support keys like
working_dir
in a similar manner, to enable the user to pass as context a Gitignore-style list of things to include/exclude from being copied as context to the LLM when using the%%ai
cell magic.The JSON object might also support keys like
url
in a similar manner: then the contents of documentation pages or patterns that are outside the pre-training data for LLMs (or impossible to access via the web scale datasets due to federal laws like HIPAA or EU laws like GDPR).Additional context
I'm happy to help prototype this and have some spare cycles for open source development. This feature would accelerate my work in health equity (https://onefact.github.io/synthetic-healthcare-data/ & https://jaanli.github.io/american-community-survey/new-york-area/income-by-race & https://jaanli.github.io/new-york-real-estate/) and ability to teach courses where I developm materials like this: https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb (these are sometimes used as advertising by for-profit companies, e.g. this dataset was reused to advertise motherduck here: https://motherduck.com/blog/introducing-column-explorer/).
Any next steps to assess whether such an argument to pass a JSON object with local directory file contents and file names, and URL names and plain text contents might be feasible?
(This could then be extended to handle URLs that have PDF file type, etc with standard python tools!)