FileFormRecPollingPDF - An error occurred - code: 200

abraster commented 1 month ago

We have indexing error when uploading the documents to the system.

Error: error

FileFormRecPollingPDF - An error occurred - code: 200 - ** Resource [93mpunkt_tab [0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('punkt_tab') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mtokenizers/punkt_tab/english/ [0m Searched in: - '/home/nltk_data' - '/usr/local/nltk_data' - '/usr/local/share/nltk_data' - '/usr/local/lib/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **

Most of the books which are in Bulgarian cannot be indexed, we tried to index one book more than 5 times, and every time is failing with the same error. Yesterday we scaled up the app service plan for 1 hour from S2 to P2 just to test and see if this is helping but again it fails with same error.

MYSdeployment commented 1 month ago

We also tried to split one document in 5 parts and the result was the same(Parts from 1 to 4 are 51MB and Part5 file is 13MB):

And again we are receiving the same error.

MYSdeployment commented 1 month ago

Any updates regarding our case?

dayland commented 1 month ago

What version are you deploying?

There was a recent issue in Unstrucutred.io with NTLK version dependencies(https://github.com/Unstructured-IO/unstructured/issues/3511), but our latest release v1.2 should be up to date with compatible versions of both.

If that is not working, I suggest trying to use punkt_tab instead of punkt in https://github.com/microsoft/PubSec-Info-Assistant/blob/1038af555477f43b5ebc095e609f2ee5b0c09a4d/functions/shared_code/utilities.py#L17 due to recent changes in NTLK (https://github.com/nltk/nltk/issues/3266#issuecomment-2284001819).

Please let us know if this resolves your error and we will update accordingly.

MYSdeployment commented 1 month ago

Hello,

We make a new deployment with the latest updates on v1.2 but we still have the issue with indexing some books.

Could you please advice how to use punkt_tab instead of punkt, we need to delete punkt or something else needs to be done?

Thank you in advance!

MYSdeployment commented 4 weeks ago

Hello,

Do we have any updates?

Thank you.

bjakems commented 4 weeks ago

Please update the following line of code in this file here-> from: "nltk.download('punkt')" to: "nltk.download('punkt_tab')". After the update is made, please run "make deploy-functions" to deploy out your updated function code to your function app. Please note, if you deployed with secure mode enabled, please establish connectivity first using a VPN. etc prior to deploying as the infrastructure is network restricted

MYSdeployment commented 4 weeks ago

Hello,

The update was made successfully:

Now our books are in state queued for more than 30 minutes.

I will check again the state tomorrow morning and provide you with feedback.

MYSdeployment commented 4 weeks ago

Hello,

All books are still in queued status.

I delete them and now trying to index them again but when I am uploading 5 files in the upload status tab I see only 3 in status uploaded.

The files are like this for more than 30 minutes.

Please help us to fix this.

MYSdeployment commented 4 weeks ago

I wait one hour and after this make a resubmit and for more than 2 hours the status of those three books is queued:

bjakems commented 3 weeks ago

Please navigate to your Azure Function App in the Azure Portal. Ensure the Azure Function App is running and all the Azure Functions are deployed and enabled. If all of those assets look OK, please use the workbook to view the logs for any errors as defined here

MYSdeployment commented 3 weeks ago

Hello,

Please be informed that all Functions are enabled:

Could you please advice how to fix this issue?

dayland commented 3 weeks ago

I would recommend debugging the functions locally in VSCode to determine the root cause error. https://github.com/microsoft/PubSec-Info-Assistant/blob/main/docs/function_debug.md

MYSdeployment commented 3 weeks ago

Hello,

This is the result from debugging:

[2024-10-17T10:08:49.460Z] Executed 'Functions.FileFormRecSubmissionPDF' (Failed, Id=135a81d4-4cba-41b5-bb5c-358f84dab57f, Duration=9ms) [2024-10-17T10:08:49.460Z] System.Private.CoreLib: Exception while executing function: Functions.FileFormRecSubmissionPDF. System.Private.CoreLib: Result: Failure [2024-10-17T10:08:49.460Z] Exception: Exception: Failed to download 'punkt' package [2024-10-17T10:08:49.460Z] Stack: File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 479, in _handlefunction_load_request [2024-10-17T10:08:49.460Z] func = loader.load_function( [2024-10-17T10:08:49.460Z] File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/utils/wrappers.py", line 44, in call [2024-10-17T10:08:49.460Z] return func(*args, **kwargs) [2024-10-17T10:08:49.460Z] File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/loader.py", line 214, in load_function [2024-10-17T10:08:49.460Z] mod = importlib.import_module(fullmodname) [2024-10-17T10:08:49.460Z] File "/usr/local/lib/python3.10/importlib/init.py", line 126, in import_module [2024-10-17T10:08:49.460Z] return _bootstrap._gcd_import(name[level:], package, level) [2024-10-17T10:08:49.460Z] File "", line 1050, in _gcd_import [2024-10-17T10:08:49.460Z] File "", line 1027, in _find_and_load [2024-10-17T10:08:49.461Z] File "", line 1006, in _find_and_load_unlocked [2024-10-17T10:08:49.461Z] File "", line 688, in _load_unlocked [2024-10-17T10:08:49.461Z] File "", line 883, in exec_module [2024-10-17T10:08:49.461Z] File "", line 241, in _call_with_frames_removed [2024-10-17T10:08:49.461Z] File "/workspaces/PubSec-Info-Assistant/functions/FileFormRecSubmissionPDF/init__.py", line 13, in [2024-10-17T10:08:49.461Z] from shared_code.utilities import Utilities [2024-10-17T10:08:49.461Z] File "/workspaces/PubSec-Info-Assistant/functions/shared_code/utilities.py", line 31, in [2024-10-17T10:08:49.461Z] raise Exception("Failed to download 'punkt' package") [2024-10-17T10:08:49.461Z] .

Could you please advice how to fix this?

Thank you in advance!

MYSdeployment commented 3 weeks ago

Hello,

Do we have any updates?

Thanks,

MYSdeployment commented 3 weeks ago

Hello,

Do we have any updates?

Thanks,

bjakems commented 3 weeks ago

Hello,

I am currently investigating this issue.
In the meantime, please ensure the punkt_tab package download is not being blocked by your networking layer. The code attempts to download the package from the public nltk GitHub repo.

Will you please provide further details to reproduce this issue: Document type: From your screenshot, it appears this is a pdf. Document content: Please don't provide the exact content, but is all the content in Bulgarian?

Thanks,

bjakems commented 3 weeks ago

The punkt_tab tokenizer doesn't support the Bulgarian language, please see here for supported languages.

For Bulgarian language support, you will need to train a tokenizer for your language or switch to another tokenizer that supports Bulgarian.

MYSdeployment commented 3 weeks ago

Hello,

We have also problem with indexing books which are fully in English.

Could you please advise how to ensure the punkt_tab package download is not being blocked by the networking layer?

Could you please advise which tokenizer we can use for Bulgarian or how to train tokenizer?

Thank you in advance!

MYSdeployment commented 3 weeks ago

This is an example of English book which can not be indexed:

MYSdeployment commented 3 weeks ago

Hello,

Can we schedule a meeting to troubleshoot this and fix it?

Thanks,

MYSdeployment commented 2 weeks ago

Hello,

As Aydin advised me to return back to punkt when I try to deploy I received the below-mentioned error:

5a93c7575f95efa7] module.functions.azurerm_linux_function_app.function_app: Still modifying... [id=/subscriptions/69945115-588c-49d7-b8fa-...icrosoft.Web/sites/infoasst-func-jlxtq, 30s elapsed] module.functions.azurerm_linux_function_app.function_app: Modifications complete after 36s [id=/subscriptions/69945115-588c-49d7-b8fa-d6df86685294/resourceGroups/infoasst-digitallibrary2/providers/Microsoft.Web/sites/infoasst-func-jlxtq] ╷ │ Error: updating App Service Plan (Subscription: "69945115-588c-49d7-b8fa-d6df86685294" │ Resource Group Name: "infoasst-digitallibrary2" │ Server Farm Name: "infoasst-enrichmentasp-jlxtq"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group.","Target":null,"Details":[{"Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."},{"Code":"Conflict"},{"ErrorEntity":{"ExtendedCode":"03029","MessageTemplate":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group.","Parameters":[],"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."}}],"Innererror":null} │ │ with module.enrichmentApp.azurerm_service_plan.appServicePlan, │ on core/host/enrichmentapp/enrichmentapp.tf line 2, in resource "azurerm_service_plan" "appServicePlan": │ 2: resource "azurerm_service_plan" "appServicePlan" { │ │ updating App Service Plan (Subscription: "69945115-588c-49d7-b8fa-d6df86685294" │ Resource Group Name: "infoasst-digitallibrary2" │ Server Farm Name: "infoasst-enrichmentasp-jlxtq"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this │ can be mitigated by deploying this to a new resource group.","Target":null,"Details":[{"Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource │ group."},{"Code":"Conflict"},{"ErrorEntity":{"ExtendedCode":"03029","MessageTemplate":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource │ group.","Parameters":[],"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."}}],"Innererror":null} ╵ make: *** [Makefile:22: infrastructure] Error 1

Could you please assist?

Thank you.

bjakems commented 2 weeks ago

Please try to restart your enrichment App Service. It seems there is a capacity issue. If that does not resolve this issue, please submit an Azure Microsoft ticket.

MYSdeployment commented 1 week ago

Hello,

Currently we are using this tokenizer:

And en-US as default language:

We are facing the below mentioned errors:

Also please be informed that after change from punkt to punkt_tab we tried to upload new book which is in PDF format around 300 pages and in English but the book stays in this status for more than one hour:

Could you please advice how to resolve them?

bjakems commented 1 week ago

Please pull latest on main and deploy the code updates. We introduced a hotfix that has both punkt and punkt_tab. Furthermore, if there's an item that is stuck in pending processing, please delete this using the UI.

If the issue still exists after the above, please reply back with as many details as possible to reproduce the error (do not include sensitive information or your file).

microsoft / PubSec-Info-Assistant

FileFormRecPollingPDF - An error occurred - code: 200 #871