huggingface / chat-ui

Open source codebase powering the HuggingChat app
https://huggingface.co/chat
Apache License 2.0
7.1k stars 1.02k forks source link

[v0.9.1] Formatting issues while rendering code #1337

Open adhishthite opened 1 month ago

adhishthite commented 1 month ago
image

@nsarrazin Whenever I ask chat-ui to explain / generate code, the < does not get rendered correctly. Can you please take a look?

nsarrazin commented 1 month ago

If you still have access, could you send me the raw conversation that shows this behaviour ? image there's a download button next to user messages in the UI

evalstate commented 1 week ago

OK. Think I can explain this one, and offer an improvement.

Code blocks in markdown can either be fenced (```html) or indented 4 spaces ( ).

The issue arises when the LLM responds with a code block that is both fenced AND indented.

In this case I think the correct behaviour is to show a code block, with the fences displayed as part of the code. VSCode and https://markdownlivepreview.com/ do this.

What is happening in Chat-UI seems to be:

In looking at this, I've bumped the marked library to 13.0.3 and then 14.0 (to see if this fix : https://github.com/markedjs/marked/pull/3264 would make a difference - it doesn't). This does change the interface a little but fairly easy to update.

In the meantime, adding this to ChatMessage returns the &lt to < and renders the fences, although doesn't benefit from the Syntax Highlighting and Copy to Clipboard button that <CodeBlock> provides.

renderer.code = (code) => {     
   return `<pre><code>${sanitizeMd(code.raw)}</code></pre>`;
}

@nsarrazin - let me know if you want this as a PR - I think this improves behaviour but isn't a proper fix. In the meantime, I'm going to tweak my System prompt to advise the LLM against using indents for Code Blocks, and ultimately think this is a defect in the marked library?

:( image

:) with additional render hook image

evalstate commented 1 week ago

Leaving the previous comment, however I am not 100% confident on whether indents are the only cause after more testing. The output below uses <CodeBlock> from list entries 10 onward...

        "updates": [
            {
                "type": "status",
                "status": "started"
            },
            {
                "type": "title",
                "title": "💻 HTML tags"
            },
            {
                "type": "finalAnswer",
                "text": "Certainly! Here are some common HTML tags with a brief description:\n\n1. **`<html>`**: The root element of an HTML page. It encloses all the content of the page.\n\n   ```html\n   <html>\n   </html>\n   ```\n\n2. **`<head>`**: Contains meta-information about the HTML document, like title and links to scripts and stylesheets.\n\n   ```html\n   <head>\n   </head>\n   ```\n\n3. **`<title>`**: Defines the title of the document, shown in the browser's title bar or tab.\n\n   ```html\n   <title>Page Title</title>\n   ```\n\n4. **`<body>`**: Contains the content of the document, such as text, images, and hyperlinks.\n\n   ```html\n   <body>\n   </body>\n   ```\n\n5. **`<h1>` to `<h6>`**: Header tags, `<h1>` is the highest level and `<h6>` is the lowest. They define headings.\n\n   ```html\n   <h1>Main Heading</h1>\n   ```\n\n6. **`<p>`**: Represents a paragraph of text.\n\n   ```html\n   <p>This is a paragraph.</p>\n   ```\n\n7. **`<a>`**: The anchor tag, used to create hyperlinks.\n\n   ```html\n   <a href=\"https://www.example.com\">Visit Example</a>\n   ```\n\n8. **`<img>`**: Embeds an image in the document. It should include the `src` (source) attribute and often an `alt` (alternative text) attribute.\n\n   ```html\n   <img src=\"image.jpg\" alt=\"Description of image\">\n   ```\n\n9. **`<ul>` / `<ol>`**: Unordered (bulleted) and ordered (numbered) list containers, respectively.\n\n   ```html\n   <ul>\n       <li>List item 1</li>\n       <li>List item 2</li>\n   </ul>\n   ```\n\n   ```html\n   <ol>\n       <li>First item</li>\n       <li>Second item</li>\n   </ol>\n   ```\n\n10. **`<li>`**: Represents a list item, used within `<ul>` or `<ol>`.\n\n   ```html\n   <li>A list item</li>\n   ```\n\n11. **`<div>`**: A generic container for content, often used for styling or layout purposes.\n\n   ```html\n   <div>This is a division.</div>\n   ```\n\n12. **`<span>`**: A generic inline container, typically used to apply styles or scripts.\n\n   ```html\n   <span style=\"color:blue\">This is a blue text.</span>\n   ```\n\n13. **`<input>`**: Represents an input field in a form, where data can be entered.\n\n   ```html\n   <input type=\"text\" name=\"username\">\n   ```\n\n14. **`<button>`**: Represents a clickable button.\n\n   ```html\n   <button>Click me</button>\n   ```\n\nRemember, these are just foundational tags, and HTML supports many more elements you can learn about as you build more complex pages.",
                "interrupted": false,
                "usage": {
                    "input_tokens": 88,
                    "output_tokens": 691
                }
            }
        ],
evalstate commented 1 week ago

Here is a snippet that shows the issue:

The handling of code blocks in lists changes; asking the LLM via Chat-UI to repeat all or part of the block verbatim shows the behaviour.

The GFM spec recommends using a blank HTML comment to disambiguate indented blocks: https://github.github.com/gfm/#example-288


## Inside a List

- This is a test (normal fences)

```html
<foo />

Outside a List

This is a test (normal fences)

<foo />

This is another test (indented block)

<foo />
    <bar />

This is another test (indents and fences)

```
<foo />
   <bar />
```

Test complete

evalstate commented 1 week ago

Final update on this for the moment - the issue also occurs when code blocks are children of lists, causing the parse(token.raw) to show the child codeblock rather than being caught by the type==="code" clause here:

https://github.com/huggingface/chat-ui/blob/97b6feb8b9ed57148e76b11944ace966029ea108/src/lib/components/chat/ChatMessage.svelte#L267-L276

Can't see an obvious quick way to fix this.