Standardize Embeddings Docs

efriis commented 2 months ago

Privileged issue

[X] I am a LangChain maintainer, or was asked directly by a LangChain maintainer to create an issue here.

Issue Content

Issue

To make our Embeddings integrations as easy to use as possible we need to make sure the docs for them are thorough and standardized. There are two parts to this: updating the embeddings docstrings and updating the actual integration docs.

This needs to be done for each embeddings integration, ideally with one PR per embedding provider.

Related to broader issues #21983 and #22005.

Docstrings

Each Embeddings class docstring should have the sections shown in the Appendix below. The sections should have input and output code blocks when relevant.

To build a preview of the API docs for the package you're working on run (from root of repo):

make api_docs_clean; make api_docs_quick_preview API_PKG=openai

where API_PKG= should be the parent directory that houses the edited package (e.g. community, openai, anthropic, huggingface, together, mistralai, groq, fireworks, etc.). This should be quite fast for all the partner packages.

Doc pages

Each Embeddings docs page should follow this template.

[ ] TODO(Erick): populate a complete example

You can use the langchain-cli to quickly get started with a new chat model integration docs page (run from root of repo):

poetry run pip install -e libs/cli
poetry run langchain-cli integration create-doc --name "foo-bar" --name-class FooBar --component-type Embeddings --destination-dir ./docs/docs/integrations/text_embedding/

where --name is the integration package name without the "langchain-" prefix and --name-class is the class name without the "Embedding" prefix. This will create a template doc with some autopopulated fields at docs/docs/integrations/text_embedding/foo_bar.ipynb.

To build a preview of the docs you can run (from root):

make docs_clean
make docs_build
cd docs/build/output-new
yarn
yarn start

Appendix

Expected sections for the Embedding class docstring.

__package_name___: This is the full name of the package (e.g., langchain-anthropic) __ModuleName__ : This is the CamelCase name of the partner (e.g., Anthropic) __MODULE_NAME__: SCREAMING_SNAKE_CASE name of the partner (e.g., ANTHROPIC)

Y

    """__ModuleName__ embedding model integration.

    # TODO: Replace with relevant packages, env vars.
    Setup:
        Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.

        .. code-block:: bash

            pip install -U __package_name__
            export __MODULE_NAME___API_KEY="your-api-key"

    # TODO: Populate with relevant params.
    Key init args — completion params:
        model: str
            Name of __ModuleName__ model to use.

   # TODO: Populate with relevant params.
    Key init args — client params:
      api_key: Optional[SecretStr]

    See full list of supported init args and their descriptions in the params section.

    # TODO: Replace with relevant init params.
    Instantiate:
        .. code-block:: python

            from __module_name__ import __ModuleName__Embeddings

            embed = __ModuleName__Embeddings(
                model="...",
                # api_key="...",
                # other params...
            )

    Embed single text:
        .. code-block:: python

            input_text = "The meaning of life is 42"
            vector = embed.embed_query(input_text)
            print(vector[:3])

        .. code-block:: python

            [-0.024603435769677162, -0.007543657906353474, 0.0039630369283258915]

    # TODO: Delete if token-level streaming isn't supported.
    Embed multiple texts:
        .. code-block:: python

             input_texts = ["Document 1...", "Document 2..."]
            vectors = embed.embed_documents(input_texts)
            print(len(vectors))
            # The first 3 coordinates for the first vector
            print(vectors[0][:3])

        .. code-block:: python

            2
            [-0.024603435769677162, -0.007543657906353474, 0.0039630369283258915]

    # TODO: Delete if native async isn't supported.
    Async:
        .. code-block:: python

            vector = await embed.aembed_query(input_text)
           print(vector[:3])

            # multiple:
            # await embed.aembed_documents(input_texts)

        .. code-block:: python

            [-0.009100092574954033, 0.005071679595857859, -0.0029193938244134188]
    """

Tip: if you copy and paste the template to a template.txt file, you could use the following sed commands to fill in the appropriate values for OpenAI:

 cat template.txt | sed 's/__package_name__/langchain_openai/g'  | sed 's/__MODULE_NAME__/OPENAI/g' | sed 's/__ModuleName__/OpenAI/' | sed 's/__module_name__/langchain_openai/'

wulifu2hao commented 1 month ago

running poetry run langchain-cli integration create-doc --name "community" --name-class Ollama --component-type Embeddings --destination-dir ./docs/docs/integrations/text_embedding/ results in

ValueError: Unrecognized component_type='Embeddings'. Expected one of 'ChatModel', 'DocumentLoader', 'Tool'.

am I missing anything?

efriis commented 1 month ago

Try updating your cli with

poetry run pip install -U langchain-cli

langchain-ai / langchain