Foreign format readers should be interruptible

szhorvat commented 2 years ago

Foreign format readers should be interruptible. The experience with OSS-fuzz shows that with most non-trivial formats, it is possible to have a very short input file (under a megabyte) that takes over a minute to process. Therefore readers should be interruptible.

Bison/Flex based:

[ ] Pajek — priority, worst offender
[ ] GML — priority
[ ] DL
[ ] NCOL
[ ] LGL

libxml2 based:

[x] GraphML (81531cbb9)

Custom:

[x] DIMACS
[x] edgelist
[x] GraphDB — low priority, no evidence that it can lock up

szhorvat commented 2 years ago

With Bison/Flex based parsers, one approach is as follows:

Use YY_USER_INIT from Flex to check for interruption.
If an interruption is detected, return an error token that Bison has no rule to handle (needs to be defined if it isn't).
Add a boolean flag to the context to allow distinguishing between true errors and interruptions. This flag is initialized to false and set to true from YY_USER_INIT when interruption occurs. Then it is read again in the top level code when error handling is done.

GroteGnoom commented 2 years ago

Maybe we should speed them up? I saw GraphDB read one character at a time, which is very slow. Reading in a megabyte and processing it should probably not take a minute.

szhorvat commented 2 years ago

Regarding GraphDB: There's no point in improving it. This format is used by a single database, which is not even online anymore. It is not an exchange format.

Regarding the others: Normal files that people would use are read reasonably quickly. However, it is possible to craft a malicious corrupted file that will take a long time to process. This could be used as part of a DoS attack on some online system that uses igraph to interpret graph formats. Most parsers are based on Bison/Flex, so it is not our code that does the IO, and there is no issue with reading characters one by one.

I opened this issue more of as a note to record how to go about this change, since I've already looked into it. It's not urgent.

szhorvat commented 2 years ago

@GroteGnoom It seems that this isomorphism testing database is actually well and alive. I just push a commit that adds the link to the docs. Feel free to experiment and see if the performance is good enough. See b04d847f4b06da8bed9bd3d809eb6407b67bc4a7

ntamas commented 2 years ago

Reading characters one by one should not be a problem; the underlying IO subsystem is buffered, so when you ask for the first character, it fills a buffer with more than one, and then subsequent calls simply return the already-read characters from the buffer until it becomes empty (at which point there will be another real IO call).

Nevertheless, adding interruption support seems like a good idea. @szhorvat did you mean YY_USER_ACTION to check for interruptions?

szhorvat commented 2 years ago

Yes, it seems like the best option.

I looked into this in some details. If either of you want to do it soon, we should talk first and I'll let you know what I found out so far. Otherwise I'll do this myself at some later time.

ntamas commented 2 years ago

We can talk about this on Thursday on the dev meeting then.

szhorvat commented 2 years ago

The cleanest solution is probably using the push parser interface instead of the pull parser one. It's unclear whether this is available with the ancient Bison 2.3 that comes with macOS. I do not think we should continue supporting this version, there's very little value in it.

Another reason for adding interruptability is that in some system, each time interruption is allowed, some parts of the host system can run. On Windows, this keeps R's GUI responsive. In Mathematica, this allows the "Dynamic" functionality to continue working. I'm guessing it might be needed for proper multithreading in Python as well?

szhorvat commented 2 years ago

Push parsers were introduced in Bison 2.3b in 2008. macOS has Bison 2.3 from 2006.

szhorvat commented 2 years ago

@ntamas Can you please do this for the GraphML reader? I'll assign both of us to this, I'll deal with Bison, and I'd like to leave GraphML to you. As far as I can tell, it's fairly easy to add interruption support in loop calling xmlParseChunk(), one should be careful and pay attention to resource allocation/deallocation.

ntamas commented 2 years ago

I'm guessing it might be needed for proper multithreading in Python as well?

There's no proper multithreading in CPython at all, at least not if you are running Python code in multiple threads. The interpreter is single-threaded, it has a Global Interpreter Lock, and you must hold the lock whenever you work with any PyObject*, so the best that you can achieve is that two or more threads take turns in accessing the Python interpreter. If you want a thread running some igraph calculation to run truly concurrently with other threads, you would need to release the interpreter lock right before calling an igraph function and then re-acquire it after returning from the igraph function, but before turning the returned igraph_error_t into a Python exception. This would be a massive change for the current hand-written C glue code in the Python layer, and we would also need to be extra careful to re-acquire the lock when the attribute handler functions are called, or when a callback is called by igraph that might in turn call into Python.

There would be a tiny improvement, though: if we assume that the Global Interpreter Lock (GIL) is held by igraph when igraph code is running, we could release and immediately re-acquire it in the interruption handler, which would give other threads a chance to get their turn in accessing the Python interpreter. This would be nice, but considering that the standard solution in Python to do parallel calculations is to run multiple processes, I don't think it's a priority there.

Can you please do this for the GraphML reader?

will do.

ntamas commented 2 years ago

GraphML reader is now interruptible since 81531cbb9

szhorvat commented 1 year ago

Status update:

The non-hacky way to implement this is to ask Bison to generate a push parser interface. This means that we are repeatedly calling the parser in a loop, which provides an opportunity to check for interruption. The obstacles with using a push parser interface are:

It requires a more recent Bison than what is included in macOS, so we would need to bump the version requirement. I believe Bison 2.4 supports it.
According to the Bison documentation, it reduces performance. This concerns me so I'm putting this on hold for the moment.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

igraph / igraph

Foreign format readers should be interruptible #2200