Closed tboenig closed 5 years ago
Too many open files
means you exceed the limit of open file descriptors by the OS.
Try setting ulimit -u
in the shell before running hocr-combine
.
However, the script should not open 1000 files without closing them, smells like a bug.
It looks like a inherent problem of argparse.FileType
which opens all filenames from the command line. Only hocr-combine
allows any number of filenames, so all other scripts don't have that problem.
@tboenig, please try the latest code. It should work now with large numbers of files, too.
thanks @stweil for debugging.
At first combine works, but in my shell-scripts
do'nt works.
Here my environment:
order file, that list the to combine files inside the order file:
hocr_out/8834993.html
hocr_out/8834994.html
sheel comand: hocr-tools/hocr-combine `tr '\n' ' ' < order.txt` > output.txt.xml
And here the error message:
Traceback (most recent call last):
File "hocr-combine", line 18, in <module>
doc = html.parse(args.filenames[0])
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 931, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:85131)
File "src/lxml/parser.pxi", line 1782, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:124005)
File "src/lxml/parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:124374)
File "src/lxml/parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:123169)
File "src/lxml/parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:117533)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
File "src/lxml/parser.pxi", line 611, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111078)
": failed to load external entity "hocr_out/8834993.html
8834993.html is the last file in the order file list
That looks like a problem with the shell script. Try cat order.txt | xargs hocr-tools/hocr-combine > output.txt.xml
.
If you have (and @tboenig has) Windows newlines, this works also:
hocr-combine $(sed 's/\r//g' order.txt)
NB: No double quotes around the subshell call and no filenames with spaces in them.
hocr-combine
could also be enhanced to accept a list of files from stdin
. Then any number of files could be combined using hocr-combine < order.txt >output.txt.xml
.
Maybe a -i/--input-file
parameter like wget(1)
has:
--input-file=file
Read URLs from a local or external file.
If - is specified as file, URLs are read from the standard input.
(Use ./- to read from a file literally named -.)
A order.txt
with CRLF line endings explains the problem and the strange looking debug message: the filename which was not found was hocr_out/8834993.html
+ CR, therefore the closing "
is printed at the beginning of the line (that's the result of the invisible "carriage return"). Of course no such file exists.
Hi @kba and @stweil,
thanks for all.
here my solution:
cat order.txt | sed 's/\r//' | xargs hocr-tools/hocr-combine > output.xml
@wrznr helped me with this. Thank you also.
The assumption that the problem lies in the CRLF line endings under windows is not present in my case. The order file is produced under linux and the combine is also realized under linux.
The need to add the sed
expression actually indicates the presence of CR in the file order.txt
. It would indeed be nice if hocr-combine
could process argument file lists. But as far as i can see no bug here.
Hi, I got the following error message:
hocr-combine: error: argument files: can't open 'hocr_out/9341474.html': [Errno 24] Too many open files: 'hocr_out/9341474.html'
These are 1308 hocr files that are to be merged with combine. The file 9341474.html is the 1022nd hocr file. I solved for now that I merged the 1022 files and then added the rest.
Since it is not very unusual that over 1000 files exist, I would suggest to enhance the counter.