Failed to index websites

tpmccallum commented 7 months ago

I have a local instance of OpenChat running on Ubuntu with an API key of a paid account (added $50). Some websites are indexed correctly but larger websites always fail after taking 1/2 hour or so to scan all of the pages. If I am getting the following error in the logs:

 error: 'Payload error: JSON payload (65897203 bytes) is larger than allowed (limit: 33554432 bytes).'

I updated the ./backend-server/app/Http/Listeners/StartRecursiveCrawler.php to increase the max amount of pages it can index; so all good there. It just does not seem to be able to finish the task after crawling all of the pages.

Any ideas would be greatly appreciated.

Also, how do I know what plan I am on. I put $50 credit but I have no knowledge of which plan I am on and what rate limits apply to me.

Thanks so much tim

codebanesr commented 7 months ago

You can try using selenium grid with celery. We have an implementation in opencopilot project.

On Fri, Nov 24, 2023, 05:05 Timothy McCallum @.***> wrote:

I have a local instance of OpenChat running on Ubuntu with an API key of a paid account (added $50). Some websites are indexed correctly but larger websites always fail after taking 1/2 hour or so to scan all of the pages. If I am getting the following error in the logs:

error: 'Payload error: JSON payload (65897203 bytes) is larger than allowed (limit: 33554432 bytes).'

[image: Screenshot 2023-11-24 at 09 28 28] https://user-images.githubusercontent.com/9831342/285322631-68c41438-3fc5-4b1a-aefe-db71569ea21c.png [image: Screenshot 2023-11-24 at 09 25 32] https://user-images.githubusercontent.com/9831342/285322423-c83325eb-1803-42e3-b6fb-25ec8bbe7d0f.png

I updated the ./backend-server/app/Http/Listeners/StartRecursiveCrawler.php to increase the max amount of pages it can index; so all good there. It just does not seem to be able to finish the task after crawling all of the pages.

Any ideas would be greatly appreciated.

Also, how do I know what plan I am on. I put $50 credit but I have no knowledge of which plan I am on and what rate limits apply to me.

Thanks so much tim

— Reply to this email directly, view it on GitHub https://github.com/openchatai/OpenChat/issues/208, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEI5ZGUNVDVYFS7WKZRBCXLYF7MUHAVCNFSM6AAAAAA7YLUBIOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDQOBZGE4DENY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tpmccallum commented 7 months ago

Thanks so much for the response @codebanesr Can you please provide some links or instructions on using the selenium grid with celery via the opencopilot project you are referring to? I am not even sure where to begin doing that. Any documentation or extra info would be greatly appreciated.

Also, I just noticed that the docs about rate limit say that I will have to wait 7 days until after I paid the $50 for the increased rate limit to come into effect. Does that sound correct? Please see reference to the 7 days in the link below.

https://platform.openai.com/docs/guides/rate-limits?context=tier-two

codebanesr commented 7 months ago

You can find the web crawler code, which utilizes Selenium, here: OpenCopilot Web Crawler

To invoke the Celery worker: OpenCopilot Celery Worker

For details on the Selenium container, please check the following Docker Compose file: OpenCopilot Docker Compose

Please note, these components are already integrated into OpenCopilot.

tpmccallum commented 7 months ago

Hi @codebanesr Thanks for your help over on https://github.com/openchatai/OpenCopilot/issues/321 I have the localhost:8000 running but it is unclear to me how I would actually index websites (by passing in URLs to those websites). I looked at the Celery and Web Crawler files that you linked to above, but I am missing something. How do scrape web content by passing in URLs? I apologies that I don't know how to use selenium grid with celery as you have suggested in the above comments. I really liked how OpenChat just lets me paste in a few URLs and then it goes and indexes those pages. What am I missing when using OpenCopilot and selenium grid with celery?

codebanesr commented 7 months ago

In a day or two opencopilot will be just as performant if not better compared . This is where you can upload your documents in opencopilot. please let me know if you need any clarification. https://i.postimg.cc/hGR0h80s/Screenshot-2023-11-30-at-3-47-04-AM.png

openchatai / OpenChat

Failed to index websites #208