Open abraster opened 1 month ago
We also tried to split one document in 5 parts and the result was the same(Parts from 1 to 4 are 51MB and Part5 file is 13MB):
And again we are receiving the same error.
Any updates regarding our case?
What version are you deploying?
There was a recent issue in Unstrucutred.io with NTLK version dependencies(https://github.com/Unstructured-IO/unstructured/issues/3511), but our latest release v1.2 should be up to date with compatible versions of both.
If that is not working, I suggest trying to use punkt_tab
instead of punkt
in https://github.com/microsoft/PubSec-Info-Assistant/blob/1038af555477f43b5ebc095e609f2ee5b0c09a4d/functions/shared_code/utilities.py#L17 due to recent changes in NTLK (https://github.com/nltk/nltk/issues/3266#issuecomment-2284001819).
Please let us know if this resolves your error and we will update accordingly.
Hello,
We make a new deployment with the latest updates on v1.2 but we still have the issue with indexing some books.
Could you please advice how to use punkt_tab instead of punkt, we need to delete punkt or something else needs to be done?
Thank you in advance!
Hello,
Do we have any updates?
Thank you.
Please update the following line of code in this file here-> from: "nltk.download('punkt')" to: "nltk.download('punkt_tab')". After the update is made, please run "make deploy-functions" to deploy out your updated function code to your function app. Please note, if you deployed with secure mode enabled, please establish connectivity first using a VPN. etc prior to deploying as the infrastructure is network restricted
Hello,
The update was made successfully:
Now our books are in state queued for more than 30 minutes.
I will check again the state tomorrow morning and provide you with feedback.
Hello,
All books are still in queued status.
I delete them and now trying to index them again but when I am uploading 5 files in the upload status tab I see only 3 in status uploaded.
The files are like this for more than 30 minutes.
Please help us to fix this.
I wait one hour and after this make a resubmit and for more than 2 hours the status of those three books is queued:
Please navigate to your Azure Function App in the Azure Portal. Ensure the Azure Function App is running and all the Azure Functions are deployed and enabled. If all of those assets look OK, please use the workbook to view the logs for any errors as defined here
Hello,
Please be informed that all Functions are enabled:
Could you please advice how to fix this issue?
I would recommend debugging the functions locally in VSCode to determine the root cause error. https://github.com/microsoft/PubSec-Info-Assistant/blob/main/docs/function_debug.md
Hello,
This is the result from debugging:
[2024-10-17T10:08:49.460Z] Executed 'Functions.FileFormRecSubmissionPDF' (Failed, Id=135a81d4-4cba-41b5-bb5c-358f84dab57f, Duration=9ms)
[2024-10-17T10:08:49.460Z] System.Private.CoreLib: Exception while executing function: Functions.FileFormRecSubmissionPDF. System.Private.CoreLib: Result: Failure
[2024-10-17T10:08:49.460Z] Exception: Exception: Failed to download 'punkt' package
[2024-10-17T10:08:49.460Z] Stack: File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 479, in _handlefunction_load_request
[2024-10-17T10:08:49.460Z] func = loader.load_function(
[2024-10-17T10:08:49.460Z] File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/utils/wrappers.py", line 44, in call
[2024-10-17T10:08:49.460Z] return func(*args, **kwargs)
[2024-10-17T10:08:49.460Z] File "/usr/lib/azure-functions-core-tools-4/workers/python/3.10/LINUX/X64/azure_functions_worker/loader.py", line 214, in load_function
[2024-10-17T10:08:49.460Z] mod = importlib.import_module(fullmodname)
[2024-10-17T10:08:49.460Z] File "/usr/local/lib/python3.10/importlib/init.py", line 126, in import_module
[2024-10-17T10:08:49.460Z] return _bootstrap._gcd_import(name[level:], package, level)
[2024-10-17T10:08:49.460Z] File "
Could you please advice how to fix this?
Thank you in advance!
Hello,
Do we have any updates?
Thanks,
Hello,
Do we have any updates?
Thanks,
Hello,
I am currently investigating this issue.
In the meantime, please ensure the punkt_tab package download is not being blocked by your networking layer. The code attempts to download the package from the public nltk GitHub repo.
Will you please provide further details to reproduce this issue: Document type: From your screenshot, it appears this is a pdf. Document content: Please don't provide the exact content, but is all the content in Bulgarian?
Thanks,
The punkt_tab tokenizer doesn't support the Bulgarian language, please see here for supported languages.
For Bulgarian language support, you will need to train a tokenizer for your language or switch to another tokenizer that supports Bulgarian.
Hello,
We have also problem with indexing books which are fully in English.
Could you please advise how to ensure the punkt_tab package download is not being blocked by the networking layer?
Could you please advise which tokenizer we can use for Bulgarian or how to train tokenizer?
Thank you in advance!
This is an example of English book which can not be indexed:
Hello,
Can we schedule a meeting to troubleshoot this and fix it?
Thanks,
Hello,
As Aydin advised me to return back to punkt when I try to deploy I received the below-mentioned error:
5a93c7575f95efa7] module.functions.azurerm_linux_function_app.function_app: Still modifying... [id=/subscriptions/69945115-588c-49d7-b8fa-...icrosoft.Web/sites/infoasst-func-jlxtq, 30s elapsed] module.functions.azurerm_linux_function_app.function_app: Modifications complete after 36s [id=/subscriptions/69945115-588c-49d7-b8fa-d6df86685294/resourceGroups/infoasst-digitallibrary2/providers/Microsoft.Web/sites/infoasst-func-jlxtq] ╷ │ Error: updating App Service Plan (Subscription: "69945115-588c-49d7-b8fa-d6df86685294" │ Resource Group Name: "infoasst-digitallibrary2" │ Server Farm Name: "infoasst-enrichmentasp-jlxtq"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group.","Target":null,"Details":[{"Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."},{"Code":"Conflict"},{"ErrorEntity":{"ExtendedCode":"03029","MessageTemplate":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group.","Parameters":[],"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."}}],"Innererror":null} │ │ with module.enrichmentApp.azurerm_service_plan.appServicePlan, │ on core/host/enrichmentapp/enrichmentapp.tf line 2, in resource "azurerm_service_plan" "appServicePlan": │ 2: resource "azurerm_service_plan" "appServicePlan" { │ │ updating App Service Plan (Subscription: "69945115-588c-49d7-b8fa-d6df86685294" │ Resource Group Name: "infoasst-digitallibrary2" │ Server Farm Name: "infoasst-enrichmentasp-jlxtq"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this │ can be mitigated by deploying this to a new resource group.","Target":null,"Details":[{"Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource │ group."},{"Code":"Conflict"},{"ErrorEntity":{"ExtendedCode":"03029","MessageTemplate":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource │ group.","Parameters":[],"Code":"Conflict","Message":"No available instances to satisfy this request. App Service is attempting to increase capacity. Please retry your request later. If urgent, this can be mitigated by deploying this to a new resource group."}}],"Innererror":null} ╵ make: *** [Makefile:22: infrastructure] Error 1
Could you please assist?
Thank you.
Please try to restart your enrichment App Service. It seems there is a capacity issue. If that does not resolve this issue, please submit an Azure Microsoft ticket.
Hello,
Currently we are using this tokenizer:
And en-US as default language:
We are facing the below mentioned errors:
Also please be informed that after change from punkt to punkt_tab we tried to upload new book which is in PDF format around 300 pages and in English but the book stays in this status for more than one hour:
Could you please advice how to resolve them?
Please pull latest on main and deploy the code updates. We introduced a hotfix that has both punkt and punkt_tab. Furthermore, if there's an item that is stuck in pending processing, please delete this using the UI.
If the issue still exists after the above, please reply back with as many details as possible to reproduce the error (do not include sensitive information or your file).
We have indexing error when uploading the documents to the system.
Error:
FileFormRecPollingPDF - An error occurred - code: 200 - ** Resource [93mpunkt_tab [0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('punkt_tab') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mtokenizers/punkt_tab/english/ [0m Searched in: - '/home/nltk_data' - '/usr/local/nltk_data' - '/usr/local/share/nltk_data' - '/usr/local/lib/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **
Most of the books which are in Bulgarian cannot be indexed, we tried to index one book more than 5 times, and every time is failing with the same error. Yesterday we scaled up the app service plan for 1 hour from S2 to P2 just to test and see if this is helping but again it fails with same error.