unoconv / unoserver

MIT License
552 stars 77 forks source link

Free up CPU ressources #64

Closed Djaiff closed 1 year ago

Djaiff commented 1 year ago

Hi, Firstly I'd like to thank you for this amazing module. I'am trying to use this for a program for converting any kind of document with an extension matching with the following regex r'(?i)^\.(odt|odp|ods|doc|ppt|xls)(x)?$'. My program works, but after a certain time it become slower and slower. So, I'm trying to set a timeout for preventing too long conversion, but it doesn't seem to work. Could you tell me wether there is a way to detect too long conversion and reset the server in that case to prevent it from consuming all CPU resources ?

Thanks in advance.

Jeff


This is my code :

try:
  pdf_target = Path(out_dir, '{}.pdf'.format(src_file.stem))
  cmd = [shutil.which('unoconvert'), '--convert-to', 'pdf',
         src_file, pdf_target]

  # usefull documentation about timeout
  # https://alexandra-zaharia.github.io/posts/kill-subprocess-and-its-children-on-timeout-python/
  p = subprocess.Popen(cmd, start_new_session=True,
                       stdout=subprocess.PIPE,
                       stderr=subprocess.PIPE,)
  exit_code = p.wait(timeout=self.TIMEOUT)
  self.LOGGER.debug('Exit code: {}'.format(exit_code))
  if pdf_target.is_file():
      return pdf_target
  else:
      self.LOGGER.warning('Error occurred while trying '
                          'to convert file {} into '
                          'PDF format. '
                          'Raw file will be uploaded to '
                          'the pdf directory.'
                          .format(src_file))

except (TimeoutError, subprocess.TimeoutExpired):
  self.LOGGER.warning('TimeoutExpired while trying to convert '
                      'file {} into PDF format. '
                      'Raw file will be uploaded to '
                      'the pdf directory. '
                      .format(src_file))
  # maybe not enough...
  os.killpg(os.getpgid(p.pid), signal.SIGTERM)
  # brute kill
  os.killpg(os.getpgid(p.pid), signal.SIGKILL)
regebro commented 1 year ago

We usually push new versions of our app every few days, and then the servers are restarted. That might be why we haven't noticed any such issues. It does seem like Libreoffice never releases memory, if your server gets lower on memory, that could be a cause.

There are problems with some documents never finishing, but that doesn't seem to be the case here?

Djaiff commented 1 year ago

Many thanks for your quick answer ! Hum, it may be the case where of never finishing conversion. But in that case, how can I force a timeout ? I'm gonna have a very intensive use of the server (about few thousands of conversion per day), and I have to process a global catchup of the two past years data. It will probably represent terabytes of files. Could you give me the least resource amount I should reserve for the server. For now, I'm using docker containers, and fix the limit at 2 cpus / 4G RAM. Is that enough in your opinion ?

regebro commented 1 year ago

I don't know how much is needed at all, and I also have no idea how to monitor docker containers. We run it on virtual machines, so if it would use up all the memory we would see that on our monitoring.

If you can find an example document that gets stuck that would be great, then I could do some tests.

Djaiff commented 1 year ago

Hi @regebro. I'm very sorry to answer with a such delay. For now I was unable to extract any document that failed to be converted. But I keep track of this discussion and will send you examples in the future that maybe could help you to improve the converter. For now, as you've suggested in another thread, I put a timeout (60 sec) for the conversion and manually restart the process in my container if it get stucked. Regards, Jeff