nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
309 stars 119 forks source link

[BUG]: vdb_upload pipeline should be stripping html tags from content #1666

Closed dagardner-nv closed 2 months ago

dagardner-nv commented 2 months ago

Version

24.06

Which installation method(s) does this occur on?

Source

Describe the bug.

I'm seeing tags like <p>, <strong>, and <h2> tags along with CSS class, id and target attributes in the summary field which is contributing to issue #1650

Minimum reproducible example

python examples/llm/main.py --log_level=warning vdb_upload pipeline --stop_after=4000 --enable_cache

Relevant log output

Click here to see error details

 [Paste the error here, it will be hidden by default]

Full env printout

Click here to see environment details

 [Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

dagardner-nv commented 2 months ago

It looks like the web scraper stage is stripping tags, however these tags are coming in from the RSS feed.