kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.45k stars 445 forks source link

Significant amounts of timeouts while using threading on Grobid Docker Service #939

Open matthieu-perso opened 2 years ago

matthieu-perso commented 2 years ago

Configuration

Problem

What would be the reason the service times out so fast ? Any workarounds if I wish for all requests to be completed ?

Code (for the local instance, identical cloud one except for url and token )

import concurrent.futures 
import time
import requests
import glob
import time
start_time = time.time()

def requesting(url, index):
    '''Requests GROBID service'''
    cloud_token = ""
    headers = {
        'Authorization': f"bearer {token}"}

    files = {
        'input': open(url, 'rb')}
    response = requests.post('http://localhost:8070/api/processFulltextDocument', files=files, headers=headers)
    return response.text, index

def main()
    filelist = glob.glob('./download/unpacked/**/*.pdf', recursive=True)

   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as thread_pool: 
      futures = []
      for index, url in enumerate(filelist):
         futures.append(thread_pool.submit(requesting, url, index)) 

   for future in concurrent.futures.as_completed(futures): 
        data, index = future.result()
        with open(f'thread_{index}.xml', 'w') as f:
            f.write(data)

if __name__ == '__main__':
      main()
      print("--- %s seconds ---" % (time.time() - start_time))
kermitt2 commented 2 years ago

Hello @MatthieuMoullecDev !

Thank you for the interest in Grobid and the issue.

You can use the Grobid python client, which is very well tested and has been able to scale to 12M PDF. Without managing the server availability (503 responses), you will get for sure these timeouts, but the python client is managing them for you.

Then the main adaptation to avoid timeout is on the server settings. You can have a look at the FAQ entry on the topic here. Two important aspects I think from your description are the amount of RAM memory and the number of threads. The settings for threads in the client and the grobid server need to be aligned with the real number of available threads available on the server.

matthieu-perso commented 2 years ago

Hey Patrice,

Thanks for your quick and helpful reply !

I saw the Python client but was struggling with an error I managed to debug (write-up here). I will have a go with it.

Thanks for the link to the production FAQs, will follow these guidelines and go from there.