ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
359 stars 78 forks source link

hocr-combine file counter #149

Closed tboenig closed 5 years ago

tboenig commented 5 years ago

Hi, I got the following error message: hocr-combine: error: argument files: can't open 'hocr_out/9341474.html': [Errno 24] Too many open files: 'hocr_out/9341474.html'

These are 1308 hocr files that are to be merged with combine. The file 9341474.html is the 1022nd hocr file. I solved for now that I merged the 1022 files and then added the rest.

Since it is not very unusual that over 1000 files exist, I would suggest to enhance the counter.

kba commented 5 years ago

Too many open files means you exceed the limit of open file descriptors by the OS.

Try setting ulimit -u in the shell before running hocr-combine.

However, the script should not open 1000 files without closing them, smells like a bug.

stweil commented 5 years ago

It looks like a inherent problem of argparse.FileType which opens all filenames from the command line. Only hocr-combine allows any number of filenames, so all other scripts don't have that problem.

stweil commented 5 years ago

@tboenig, please try the latest code. It should work now with large numbers of files, too.

tboenig commented 5 years ago

thanks @stweil for debugging. At first combine works, but in my shell-scripts do'nt works. Here my environment:

  1. order file, that list the to combine files inside the order file:

    hocr_out/8834993.html
    hocr_out/8834994.html
  2. sheel comand: hocr-tools/hocr-combine `tr '\n' ' ' < order.txt` > output.txt.xml

And here the error message:

Traceback (most recent call last):
  File "hocr-combine", line 18, in <module>
    doc = html.parse(args.filenames[0])
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 931, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:85131)
  File "src/lxml/parser.pxi", line 1782, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:124005)
  File "src/lxml/parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:124374)
  File "src/lxml/parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:123169)
  File "src/lxml/parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:117533)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
  File "src/lxml/parser.pxi", line 611, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111078)
": failed to load external entity "hocr_out/8834993.html

8834993.html is the last file in the order file list

stweil commented 5 years ago

That looks like a problem with the shell script. Try cat order.txt | xargs hocr-tools/hocr-combine > output.txt.xml.

kba commented 5 years ago

If you have (and @tboenig has) Windows newlines, this works also:

hocr-combine $(sed 's/\r//g' order.txt)

NB: No double quotes around the subshell call and no filenames with spaces in them.

stweil commented 5 years ago

hocr-combine could also be enhanced to accept a list of files from stdin. Then any number of files could be combined using hocr-combine < order.txt >output.txt.xml.

kba commented 5 years ago

Maybe a -i/--input-file parameter like wget(1) has:


--input-file=file
    Read URLs from a local or external file.
    If - is specified as file, URLs are read from the standard input.
    (Use ./- to read from a file literally named -.)
stweil commented 5 years ago

A order.txt with CRLF line endings explains the problem and the strange looking debug message: the filename which was not found was hocr_out/8834993.html + CR, therefore the closing " is printed at the beginning of the line (that's the result of the invisible "carriage return"). Of course no such file exists.

tboenig commented 5 years ago

Hi @kba and @stweil, thanks for all. here my solution: cat order.txt | sed 's/\r//' | xargs hocr-tools/hocr-combine > output.xml

@wrznr helped me with this. Thank you also.

The assumption that the problem lies in the CRLF line endings under windows is not present in my case. The order file is produced under linux and the combine is also realized under linux.

wrznr commented 5 years ago

The need to add the sed expression actually indicates the presence of CR in the file order.txt. It would indeed be nice if hocr-combine could process argument file lists. But as far as i can see no bug here.