Confluence loader only returns partial `pageContent`

jazelly commented 3 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[x] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

const loader = new ConfluencePagesLoader({
    baseUrl: 'xxx',
    spaceKey: 'xxx',
    username: 'xxx',
    accessToken: 'xxx',
  });

const docs = await loader.load()

Error Message and Stack Trace (if applicable)

N/A

Description

When using this piece of code to retrieve confluence doc with SQL code blocks, the returned data redacted the sql block like

{
  pageContent: VIEW ALL\nsql\nASSIGN TO AN ACCOUNT\nThe account must already exist.\nsql\n
}

I'm not sure if this is a limitation/feature from Atlassian API

System Info

platform: windows node version: v22.2.0 yarn version: 1.22.22

dosubot[bot] commented 3 months ago

Hey @jazelly! I'm here to help you with any issues you're facing. I can assist with bug fixes, answer questions, and guide you on becoming a contributor. Let me know how I can assist you!

The Confluence loader in LangChain.js returns partial pageContent and redacts SQL code blocks because it converts the HTML content to plain text using the htmlToText function. This conversion process can result in the loss of certain formatting and content, including code blocks ^[1].

To address this issue, you can customize the htmlToText function to preserve code blocks, including SQL code blocks, during the conversion process. Here is an example configuration:

const { htmlToText } = require('html-to-text');

const options = {
  preserveNewlines: true,
  format: {
    pre: function(elem, walk, builder, formatOptions) {
      builder.openBlock({ leadingLineBreaks: 1 });
      walk(elem.children, builder);
      builder.closeBlock({ trailingLineBreaks: 1 });
    }
  }
};

const text = htmlToText('<pre><code class="language-sql">SELECT * FROM users;</code></pre>', options);
console.log(text);

This configuration will preserve the SQL code block during the conversion process ^[2].

Additionally, ensure that the expand parameter is set to body.storage when making the API request to retrieve the full content, including code blocks ^[1].

To continue talking to Dosu, mention @dosu.

jazelly commented 3 months ago

@dosu how do I preserve the HTML response instead of the plain text response when using load(). Looks like it's not exposed like python version

dosubot[bot] commented 3 months ago

To preserve the HTML response instead of converting it to plain text when using the load() function in the Confluence loader, you need to modify the createDocumentFromPage method. Here is how you can do it:

Modify the createDocumentFromPage method: Update the method to use the HTML content directly instead of converting it to plain text.

private createDocumentFromPage(page: ConfluencePage): Document {
  // Use the HTML content directly
  const htmlContent = page.body.storage.value;

  // Generate the URL
  const pageUrl = `${this.baseUrl}/spaces/${this.spaceKey}/pages/${page.id}`;

  // Return a langchain document with HTML content
  return new Document({
    pageContent: htmlContent,
    metadata: {
      id: page.id,
      status: page.status,
      title: page.title,
      type: page.type,
      url: pageUrl,
      version: page.version?.number,
      updated_by: page.version?.by?.displayName,
      updated_at: page.version?.when,
    },
  });
}

Ensure the load method uses the updated createDocumentFromPage: The load method will automatically use the updated createDocumentFromPage method to preserve the HTML content.

public async load(options?: {
  start?: number;
  limit?: number;
}): Promise<Document[]> {
  try {
    const pages = await this.fetchAllPagesInSpace(
      options?.start,
      options?.limit
    );
    return pages.map((page) => this.createDocumentFromPage(page));
  } catch (error) {
    console.error("Error:", error);
    return [];
  }
}

By making these changes, the load() function will preserve the HTML response instead of converting it to plain text ^[1].

To continue talking to Dosu, mention @dosu.

jazelly commented 3 months ago

pretty sure this is an unsupported feature in html-to-text, as the returned data are XML from Atalassian, and html-to-text is unable to do this. I propose to use xmldom to do this.

I think the python version can do the job as it's using BeautifulSoup

langchain-ai / langchainjs