langchain-ai / langchainjs

πŸ¦œπŸ”— Build context-aware reasoning applications πŸ¦œπŸ”—
https://js.langchain.com/docs/
MIT License
11.79k stars 1.97k forks source link

Document.pageContent should not allow undefined #5884

Closed glorat closed 1 week ago

glorat commented 1 week ago

Checked other resources

Example Code

import { describe, it, expect } from 'vitest'
import { Document } from '@langchain/core/documents'

describe('langchain Type Compatibility Test', () => {
  it('Document should handle empty pageContent', () => {
    const doc = new Document({ pageContent: '' })
    expect(doc.pageContent).toStrictEqual('')
  })
})

Error Message and Stack Trace (if applicable)

AssertionError: expected undefined to strictly equal '' Expected : Actual :undefined

Description

pageContent should be set to '', as mandated by its type contract string (and not string|undefined)

The bug is visible in the code https://github.com/langchain-ai/langchainjs/blob/main/langchain-core/src/documents/document.ts

export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput, DocumentInterface
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent
      ? fields.pageContent.toString()
      : this.pageContent;
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

Line 32 should check field.pageContent is undefined, not just truthy, otherwise it uses this.pageContent which is undefined and breaks the contract that pageContent must be a string. Thus in any event, pageContent needs to be initialised to something to fulfill the type contract

System Info

This will happen on any environment

dosubot[bot] commented 1 week ago

To ensure that the pageContent property is always set to a string, even when it is undefined, you can modify the constructor of the Document class to default pageContent to an empty string if it is not provided. Here is the corrected code:

export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput, DocumentInterface
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent !== undefined
      ? fields.pageContent.toString()
      : "";
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

This change ensures that pageContent will always be a string, defaulting to an empty string if it is not provided [1].

To continue talking to Dosu, mention @dosu.

glorat commented 1 week ago

@dosu I agree with you and the proposed fix. Let's also have a unit test for it. How about this:

import { test, expect } from "@jest/globals";
import {Document} from "../documents/document.js";

test("Document", async () => {
    const doc = new Document({pageContent: ''})
    expect (doc.pageContent).toEqual('')
});
dosubot[bot] commented 1 week ago

Your proposed unit test looks good. It effectively checks that the pageContent is set to an empty string when an empty string is provided. You might also want to add a test case to ensure that pageContent defaults to an empty string when it is not provided. Here is an extended version of your test:

import { test, expect } from "@jest/globals";
import { Document } from "../documents/document.js";

test("Document should handle empty pageContent", () => {
    const doc = new Document({ pageContent: '' });
    expect(doc.pageContent).toEqual('');
});

test("Document should default pageContent to an empty string when not provided", () => {
    const doc = new Document({});
    expect(doc.pageContent).toEqual('');
});

This ensures that both scenarios are covered.

To continue talking to Dosu, mention @dosu.

glorat commented 1 week ago

Dosu's second test not applicable since DocumentInput does not allow {}

Taken the liberty of making a PR