This branch is built off of the branch that generates better results for dataset queries, but the commits I've added to this branch are designed to generate better results for docs queries.
Changes include:
Converting HTML to Markdown and preprocessing with docs-search functions
Handling of API docs separately - they are bloated, so I just pick out the relevant portions depending on the filepath
Adding "py" to code blocks in pre-processing so that the model is nudged to do the same
Splitting the Markdown only WITHIN subsections, so we ensure that continuity is kept
Adding rules for the docs QA prompt
Using Tiktoken to count tokens in retrieved documents
Updating embeddings to correspond to the latest docs
This branch is built off of the branch that generates better results for dataset queries, but the commits I've added to this branch are designed to generate better results for docs queries.
Changes include:
Try out some queries!