sylikc / pyexiftool

PyExifTool (active PyPI project) - A Python library to communicate with an instance of Phil Harvey's ExifTool command-line application. Runs one process with special -stay_open flag, and pipes data to/from. Much more efficient than running a subprocess for each command!
Other
148 stars 19 forks source link

ValueError: filedescriptor out of range in select() #97

Open vpv-csc opened 1 week ago

vpv-csc commented 1 week ago

We are doing digital preservation. In some cases we are scraping metadata from thousands of image files in the same python process. As far as I understand, pyexiftool handles multiple files in the -stay_open mode. We are seeing the ValueError: filedescriptor out of range in select() error a lot in production.

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 812, in run
    self._ver = self._parse_ver()
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 1199, in _parse_ver
    return self.execute("-ver").strip()
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 132, in execute
    result: Union[str, bytes] = super().execute(*str_bytes_params, **kwargs)
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 1009, in execute
    raw_stdout = _read_fd_endswith(fdout, seq_ready.encode(self._encoding), self._block_size)
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 142, in _read_fd_endswith
    inputready, outputready, exceptready = select.select([fd], [], [])
ValueError: filedescriptor out of range in select()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/bin/check-sip-digital-objects-3", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3.9/site-packages/ipt/scripts/check_sip_digital_objects.py", line 39, in main
    report = validation_report(
  File "/usr/lib/python3.9/site-packages/ipt/scripts/check_sip_digital_objects.py", line 448, in validation_report
    for result in validation(mets_path=mets_path, catalog_path=catalog_path):
  File "/usr/lib/python3.9/site-packages/ipt/scripts/check_sip_digital_objects.py", line 356, in validation
    yield _validate(metadata_info)
  File "/usr/lib/python3.9/site-packages/ipt/scripts/check_sip_digital_objects.py", line 328, in _validate
    scraper_result, streams, grade = check_well_formed(
  File "/usr/lib/python3.9/site-packages/ipt/scripts/check_sip_digital_objects.py", line 174, in check_well_formed
    (mime, version) = scraper.detect_filetype()
  File "/usr/lib/python3.9/site-packages/file_scraper/scraper.py", line 242, in detect_filetype
    self._identify()
  File "/usr/lib/python3.9/site-packages/file_scraper/scraper.py", line 77, in _identify
    self._update_filetype(exiftool_detector)
  File "/usr/lib/python3.9/site-packages/file_scraper/scraper.py", line 89, in _update_filetype
    tool.detect()
  File "/usr/lib/python3.9/site-packages/file_scraper/detectors.py", line 339, in detect
    with exiftool.ExifToolHelper() as et:
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 317, in __enter__
    self.run()
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 150, in run
    super().run()
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 816, in run
    raise ExifToolVersionError(f"Error retrieving Exiftool info.  Is your Exiftool version ('exiftool -ver') >= required version ('{constants.EXIFTOOL_MINIMUM_VERSION}')?")
exiftool.exceptions.ExifToolVersionError: Error retrieving Exiftool info.  Is your Exiftool version ('exiftool -ver') >= required version ('12.15')?

If you happen to be interested in the check-sip-digital-objects(-3) command seen in the backtrace, that's here: https://github.com/Digital-Preservation-Finland/dpres-ipt And our scraping tool is here: https://github.com/Digital-Preservation-Finland/file-scraper/

We are running version 0.5.5 that we packaged ourselves. It seems that 0.5.6 does not change anything related to this issue. Exiftool is 12.70.

Someone has reported this same issue here earlier: https://exiftool.org/forum/index.php?topic=11067.0

man 2 select says

WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably low limit for many modern applications—and this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.

sylikc commented 1 week ago

I see, it's the same issue that someone else had reported... but without a trace and how it got there, it wasn't something I was ready to look at.

So with the man 2 select does that imply you're running a lot of processes at the same time? Like using many file descriptors across multiple processes on the same machine?

I will consider looking into poll and epoll... I see the the Python select module supports them, but it's not a straightforward drop-in replacement... and we'd probably have to run some benchmarks to see if the whole call chain gets slower. select.select is pretty fast AFAIK

vpv-csc commented 5 days ago

So with the man 2 select does that imply you're running a lot of processes at the same time? Like using many file descriptors across multiple processes on the same machine?

Yes. We often have thousands of image files in a single Submission Information Package (it's either a zip or a tar file with metadata in an XML file).

I was not able to reproduce this on my workstation. I created a directory with 4096 JPEG files, but running pyexiftool in a loop so that it takes one file at a time (i.e. not giving it a list of filenames) like we use it in our file-scraper did not cause this issue. It might be that you have to be (un)lucky enough to get an FD > 1024 for this to happen.

Our ability to test in production is somewhat limited but we'll see what we can do. I think we could count FDs and list the largest FD numbers while pyexiftool is running.

jukuisma commented 5 days ago

So with the man 2 select does that imply you're running a lot of processes at the same time? Like using many file descriptors across multiple processes on the same machine?

Yes. We often have thousands of image files in a single Submission Information Package (it's either a zip or a tar file with metadata in an XML file).

I was not able to reproduce this on my workstation. I created a directory with 4096 JPEG files, but running pyexiftool in a loop so that it takes one file at a time (i.e. not giving it a list of filenames) like we use it in our file-scraper did not cause this issue. It might be that you have to be (un)lucky enough to get an FD > 1024 for this to happen.

Our ability to test in production is somewhat limited but we'll see what we can do. I think we could count FDs and list the largest FD numbers while pyexiftool is running.

@vpv-csc if you can reproduce this please try running:

$ strace -fo strace.out python3 test.py

And check that all opened FDs are properly closed. I can't reproduce this locally and all FDs are closed properly.

jukuisma commented 5 days ago

Still can't reproduce the bug in our specific use case, but I can reproduce it with:

$ ulimit -n 8192
$ mkdir files
$ for i in {0..4096}; do echo $i > files/$i; done
$ python3
>>> import exiftool
>>> exiftool.ExifToolHelper().get_metadata("files/1")
[{'SourceFile': 'files/1', 'ExifTool:ExifToolVersion': 12.7, 'File:FileName': 1, 'File:Directory': 'files', 'File:FileSize': 2, 'File:FileModifyDate': '2024:09:19 10:36:18+00:00', 'File:FileAccessDate': '2024:09:23 10:46:32+00:00', 'File:FileInodeChangeDate': '2024:09:19 10:36:18+00:00', 'File:FilePermissions': 100644, 'File:FileType': 'TXT', 'File:FileTypeExtension': 'TXT', 'File:MIMEType': 'text/plain', 'File:MIMEEncoding': 'us-ascii', 'File:Newlines': '\n', 'File:LineCount': 1, 'File:WordCount': 1}]
>>> files = []
>>> for i in range(4096):
...     files.append(open(f"files/{i}"))
... 
>>> exiftool.ExifToolHelper().get_metadata("files/1")
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 812, in run
    self._ver = self._parse_ver()
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 1199, in _parse_ver
    return self.execute("-ver").strip()
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 132, in execute
    result: Union[str, bytes] = super().execute(*str_bytes_params, **kwargs)
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 1009, in execute
    raw_stdout = _read_fd_endswith(fdout, seq_ready.encode(self._encoding), self._block_size)
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 142, in _read_fd_endswith
    inputready, outputready, exceptready = select.select([fd], [], [])
ValueError: filedescriptor out of range in select()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 293, in get_metadata
    return self.get_tags(files, None, params=params)
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 378, in get_tags
    ret = self.execute_json(*exec_params)
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 1127, in execute_json
    result = self.execute("-j", *params)  # stdout
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 120, in execute
    self.run()
  File "/usr/lib/python3.9/site-packages/exiftool/helper.py", line 150, in run
    super().run()
  File "/usr/lib/python3.9/site-packages/exiftool/exiftool.py", line 816, in run
    raise ExifToolVersionError(f"Error retrieving Exiftool info.  Is your Exiftool version ('exiftool -ver') >= required version ('{constants.EXIFTOOL_MINIMUM_VERSION}')?")
exiftool.exceptions.ExifToolVersionError: Error retrieving Exiftool info.  Is your Exiftool version ('exiftool -ver') >= required version ('12.15')?
jukuisma commented 3 days ago

Fixed here: https://github.com/sylikc/pyexiftool/pull/98.