`pos_regex` does not behave as expected for indexing Salesforce knowledge base pages

mig281 commented 5 months ago

urls: ["https://www.hbs.edu"]
pos_regex: [".*hbs.edu.*"]

However, when I try the same as below, I get a grand total of 2 URLs:

urls: ["https://care.qumulo.com"]
pos_regex: [".*care.qumulo.com.*"]

mig281 commented 5 months ago

This other example does not enclose the value for urls in double quotation marks.

urls: [https://sf.gov]
pos_regex: [".*sf.gov.*"]

Could you please give a proper example for subdomains?

When I use the Regular Expressions 101 website, .*care.qumulo.com.* appears to match https://care.qumulo.com/foo-bar-baz. However, the following doesn't do anything (urls value without double quotation marks):

urls: [https://care.qumulo.com]
pos_regex: [".*care.qumulo.com.*"]

mig281 commented 5 months ago

Previously, the following worked—but it seemed to get way too many URLs:

urls: [https://care.qumulo.com]
url_regex: ["https://care.qumulo.*"]

Simply changing url_regex to pos_regex results in only 2 URLs being found again. 🤦

mig281 commented 5 months ago

Maybe I'm crazy, but the script should not be detecting com/s as a file type?

2024-04-05 01:25:08,726 - root - INFO - Starting crawl of type website...
2024-04-05 01:25:11,199 - root - INFO - Found 1 URLs on https://care.qumulo.com/s/
2024-04-05 01:25:11,199 - root - INFO - Collected 1 URLs to crawl and index
2024-04-05 01:25:11,199 - root - INFO - File types = ['com/s/']
2024-04-05 01:25:11,199 - root - INFO - Using 2 ray workers
2024-04-05 01:25:13,122 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=9.70gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-04-05 01:25:13,188 INFO worker.py:1642 -- Started a local Ray instance.
2024-04-05 01:25:26,689 - root - INFO - Finished crawl of type website...
(PageCrawlWorker pid=605) INFO:crawlers.website_crawler:Indexing https://care.qumulo.com/s/ was successful

This is the output from the following regex:

website_crawler:
  urls: [https://care.qumulo.com/s/]
  pos_regex: [".*care.qumulo.com\/s.*"]

I'm giving up for tonight. Send help.

ofermend commented 5 months ago

Thanks for reporting. Investigating and will revert back.

ofermend commented 5 months ago

Okay I think I found the issue and will send out a bugfix after testing. How many webpages roughly do you expect to have in care.qumulo.com?

Note: with website_crawler, the crawler finds all the pages as it scours the site, and is limited to max_depth (default = 3), which means 3 hops from the original page. If your website is more complex you can always try larger max_depth, and may take more time.

mig281 commented 5 months ago

@ofermend I expect to get 170~180 URLs from care.qumulo.com. The problem isn't only that I'm not getting the correct number of pages, however; restricting the search to the care.qumulo.com subdomain by using neg_regex does not appear to work, and when I have tried what you suggest, I eventually get all the pages on qumulo.com which is the opposite of what I wanted.

ofermend commented 5 months ago

Ok, I just merged in a fix. Can you please try to pull the latest branch and see if it fixes the issue for you? I've been using this config locally:

website_crawler: urls: [https://care.qumulo.com] pos_regex: ["https://care.qumulo.com/.*"] delay: 1 max_depth: 3 pages_source: crawl extraction: playwright

mig281 commented 5 months ago

@ofermend No go. I got the latest code and this is my config:

vectara:
  corpus_id: 4
  customer_id: <redacted>
  reindex: true

crawling:
  crawler_type: website

website_crawler:
  urls: [https://care.qumulo.com]
  pos_regex: ["https://care.qumulo.com/.*"]
  pages_source: crawl
  extraction: playwright
  delay: 1
  max_depth: 3
  ray_workers: 2

Here's the output:

$ docker logs -f vingest
2024-04-05 19:16:07,243 - root - INFO - Starting crawl of type website...
2024-04-05 19:16:10,100 - root - INFO - collected 2 URLs so far
2024-04-05 19:16:12,082 - root - INFO - Found 2 URLs on https://care.qumulo.com
2024-04-05 19:16:12,082 - root - INFO - Collected 1 URLs to crawl and index
2024-04-05 19:16:12,082 - root - INFO - File types = ['com/']
2024-04-05 19:16:12,083 - root - INFO - Using 2 ray workers
2024-04-05 19:16:14,101 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=9.69gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-04-05 19:16:14,264 INFO worker.py:1642 -- Started a local Ray instance.
2024-04-05 19:16:26,804 - root - INFO - Finished crawl of type website...
(PageCrawlWorker pid=642) INFO:crawlers.website_crawler:Indexing https://care.qumulo.com/ was successful

Please note the weird File types = ['com/'] bit. It also bears mentioning that the original ingest command threw an error but ran anyway.

$ ./run.sh config/qumulo-care-v3.yaml default
unknown flag: --build-arg
See 'docker --help'.

[...a bunch of Docker help text]

6f4100439eef565187a2a2c47d2e2d52fee823e6f1ddf15659f6aebf6df96938
Success! Ingest job is running.
You can try 'docker logs -f vingest' to see the progress.

ofermend commented 5 months ago

Okay, two things:

Just to make sure - you pulled the latest from github and under "main" branch?
If the docker build failed then of course nothing will really change, so I'm not surprise. Let's debug the issue with "unknown flag: --build-arg" - I think once that works it will be running the latest and we'll really see the change. Which version of docker are you using? What does "docker version" return?

mig281 commented 5 months ago

@ofermend Here's the output from docker version:

$ docker version
/nix/store/rm1hz1lybxangc8sdl7xvzs5dcvigvf7-bash-4.4-p23/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Client:
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        633a0ea838f10e000b7c6d6eed1623e6e988b5bc
 Built:             Sat Jun 13 11:21:02 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.3
  Git commit:       24.0.5-0ubuntu1~22.04.1
  Built:            Mon Aug 21 19:50:14 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.2
  GitCommit:        
 runc:
  Version:          1.1.7-0ubuntu1~22.04.2
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:

mig281 commented 5 months ago

@ofermend We did find another issue which might take precedence over this one: When they migrated our content on care.qumulo.com from Zendesk to Salesforce...it appears that the actual content of the pages no longer appears within the page itself. It all seems to be generated on the fly by JS; here's an example: https://care.qumulo.com/s/article/Getting-Started-with-Qumulo 🤦

I wonder whether I should be using a different crawler so that vectara-ingest can actually "see" the same content a browser would see...or whether this sort of functionality is available at all. 😬

ofermend commented 5 months ago

Hey @mig281

Regarding docker version, you are using 19.03 and --build-arg should have been supported since much earlier so I am not sure why this is occurring. Perhaps we can find time for a quick live session to review? Or if you have the ability to just upgrade docker itself to a latest version and try again that might help?
Regarding the crawling, the use of "extraction: playwright" is aimed to address this issue. Do you know if the original crawl used "extraction: pdf" instead? (or maybe that was missing altogether)? If we reindex properly with the playwright option - that's exactly what it's aimed to do - wait until the JS renders the page and then extracts the content from there. However if you are confident playwright was used all the time, please send me a specific URL and I'm happy to test locally and debug as well.

mig281 commented 5 months ago

@ofermend I definitely think I'm up to date with Docker and playwright has always been used, to my knowledge. Would you please shoot me an email to mkhmelnitsky AT qumulo so we could have a live debug session tomorrow?

ofermend commented 5 months ago

@mig281 - okay to close this issue now that we've addressed it?

ofermend commented 4 months ago

Closing for now as this should now be fixed.

vectara / vectara-ingest

`pos_regex` does not behave as expected for indexing Salesforce knowledge base pages #77