Open szhorvat opened 2 years ago
With Bison/Flex based parsers, one approach is as follows:
YY_USER_INIT
from Flex to check for interruption.context
to allow distinguishing between true errors and interruptions. This flag is initialized to false
and set to true
from YY_USER_INIT
when interruption occurs. Then it is read again in the top level code when error handling is done.Maybe we should speed them up? I saw GraphDB read one character at a time, which is very slow. Reading in a megabyte and processing it should probably not take a minute.
Regarding GraphDB: There's no point in improving it. This format is used by a single database, which is not even online anymore. It is not an exchange format.
Regarding the others: Normal files that people would use are read reasonably quickly. However, it is possible to craft a malicious corrupted file that will take a long time to process. This could be used as part of a DoS attack on some online system that uses igraph to interpret graph formats. Most parsers are based on Bison/Flex, so it is not our code that does the IO, and there is no issue with reading characters one by one.
I opened this issue more of as a note to record how to go about this change, since I've already looked into it. It's not urgent.
@GroteGnoom It seems that this isomorphism testing database is actually well and alive. I just push a commit that adds the link to the docs. Feel free to experiment and see if the performance is good enough. See b04d847f4b06da8bed9bd3d809eb6407b67bc4a7
Reading characters one by one should not be a problem; the underlying IO subsystem is buffered, so when you ask for the first character, it fills a buffer with more than one, and then subsequent calls simply return the already-read characters from the buffer until it becomes empty (at which point there will be another real IO call).
Nevertheless, adding interruption support seems like a good idea. @szhorvat did you mean YY_USER_ACTION
to check for interruptions?
Yes, it seems like the best option.
I looked into this in some details. If either of you want to do it soon, we should talk first and I'll let you know what I found out so far. Otherwise I'll do this myself at some later time.
We can talk about this on Thursday on the dev meeting then.
The cleanest solution is probably using the push parser interface instead of the pull parser one. It's unclear whether this is available with the ancient Bison 2.3 that comes with macOS. I do not think we should continue supporting this version, there's very little value in it.
Another reason for adding interruptability is that in some system, each time interruption is allowed, some parts of the host system can run. On Windows, this keeps R's GUI responsive. In Mathematica, this allows the "Dynamic" functionality to continue working. I'm guessing it might be needed for proper multithreading in Python as well?
Push parsers were introduced in Bison 2.3b in 2008. macOS has Bison 2.3 from 2006.
@ntamas Can you please do this for the GraphML reader? I'll assign both of us to this, I'll deal with Bison, and I'd like to leave GraphML to you. As far as I can tell, it's fairly easy to add interruption support in loop calling xmlParseChunk()
, one should be careful and pay attention to resource allocation/deallocation.
I'm guessing it might be needed for proper multithreading in Python as well?
There's no proper multithreading in CPython at all, at least not if you are running Python code in multiple threads. The interpreter is single-threaded, it has a Global Interpreter Lock, and you must hold the lock whenever you work with any PyObject*
, so the best that you can achieve is that two or more threads take turns in accessing the Python interpreter. If you want a thread running some igraph calculation to run truly concurrently with other threads, you would need to release the interpreter lock right before calling an igraph function and then re-acquire it after returning from the igraph function, but before turning the returned igraph_error_t
into a Python exception. This would be a massive change for the current hand-written C glue code in the Python layer, and we would also need to be extra careful to re-acquire the lock when the attribute handler functions are called, or when a callback is called by igraph that might in turn call into Python.
There would be a tiny improvement, though: if we assume that the Global Interpreter Lock (GIL) is held by igraph when igraph code is running, we could release and immediately re-acquire it in the interruption handler, which would give other threads a chance to get their turn in accessing the Python interpreter. This would be nice, but considering that the standard solution in Python to do parallel calculations is to run multiple processes, I don't think it's a priority there.
Can you please do this for the GraphML reader?
will do.
GraphML reader is now interruptible since 81531cbb9
Status update:
The non-hacky way to implement this is to ask Bison to generate a push parser interface. This means that we are repeatedly calling the parser in a loop, which provides an opportunity to check for interruption. The obstacles with using a push parser interface are:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Foreign format readers should be interruptible. The experience with OSS-fuzz shows that with most non-trivial formats, it is possible to have a very short input file (under a megabyte) that takes over a minute to process. Therefore readers should be interruptible.
Bison/Flex based:
libxml2 based:
Custom: