ocsigen / lwt

OCaml promises and concurrent I/O
https://ocsigen.org/lwt
MIT License
709 stars 175 forks source link

The library Lwt raise CamlinternalLazy.Undefined #691

Closed FardaleM closed 5 years ago

FardaleM commented 5 years ago

I wrote a piece of code to run computation on multiple machines. A node receive a computation and run it with a call to Lwt.async. Some node crash with this error:

Fatal error: exception CamlinternalLazy.Undefined
Raised at file "map.ml", line 135, characters 10-25
Called from file "src/core/lwt.ml", line 796, characters 20-60

The code can be found here I don't know if there is a link with this, but the computation running on the node output a lot to stderr.

raphael-proust commented 5 years ago

Are you sure Lwt.async is what you want? The documentation of async specifically states “[Lwt.async] is misleadingly named. Itself, it has nothing to do with asynchronous execution.”
Are you looking for Lwt_preemptive.detach or something of the sort?

Anyway, that doesn't address the exception… Line 796 is about recovering “promise-local” values. These values would not travel across your cluster's machine. Does the code you execute on your cluster uses “promise-local” values?

FardaleM commented 5 years ago

Yes I know, I added Lwt.async as a quick patch to an other problem. I need to look at it.

What is a "promise-local" value ? What run on the cluster is the following: the node run a server with Lwt_io.establish_server_with_client_address that receive computation. A computation is a bash script that will run on the machine. Once the script has been received, the computation is run in the Lwt.async call. The script is start with Lwt_process.exec and the program read stdout and stderr, and open a connection to the server to send stdout and stderr back to the master node.

In my current setup, each node can run 3 jobs concurrently. From my test, this exception is raised when the bash script output a lot of stuff in stderr and also when the 3 jobs start at the same time. (The script run for 25min with a timeout)

aantron commented 5 years ago

Tells us what version of Lwt is installed when this is happening.

If you want us to help you further, please give detailed instructions for how to run whatever test you are running, so we can run it ourselves.

FardaleM commented 5 years ago

I run experiment on 7 nodes + 1 master. This bug appear quite often when I run some computation. The following computation run for 20min. I did not have time to find a smaller and easier example yet. To reproduce this, you need to compile this on each node, this will be the main program of the computation. You need to download and extract this on each node. This are the input of the program. And you need to get the file batch_steiner.sh, config_master, config_node from there and adapt path in batch_steiner.sh to your case and the ip in config_master.

With all this. you can start a master server with this command: OCAMLRUNPARAM=b ocluster master config_master -vvvv a node with this command: OCAMLRUNPARAM=b ocluster node config_node -vvv and submit a job to the master with this command: ocluster client -i 100 batch_steiner.sh strongpassword 127.0.0.1

You can use ocluster --help to have a small man page.

FardaleM commented 5 years ago

I have two new error. The program was the same but with TL in batch_steiner.sh to 10. Which means the program run for only 10s.

Fatal error: exception CamlinternalLazy.Undefined
Raised at file "map.ml", line 135, characters 10-25
Called from file "src/core/lwt.ml", line 796, characters 20-60

and

Fatal error: exception Unix.Unix_error(Unix.EMFILE, "pipe", "")
Raised at file "src/core/lwt_pqueue.ml", line 61, characters 15-30
Called from file "src/core/lwt_pqueue.ml", line 69, characters 12-24
raphael-proust commented 5 years ago

That second one is a system error that's too many file descriptors open I think. The most likely cause is that you don't close the file descriptors when the program crashes/ends. It's probably unrelated to Lwt.

aantron commented 5 years ago

Closing this for now, please re-open if you have more info.

The CamlinternalLazy.Undefined error is strange, because what is raised at that line in map.ml (part of the compiler standard library) is Not_found. This makes me wonder if the exception printer is printing the exception correctly. I wonder if you are marshalling values or exceptions in a strange way.

However, that Not_found should be caught by Lwt, on line 801 in lwt.ml. So, this suggests that the problem is not in the exception printer, but the representation of the exception is wrong for some reason, so that even pattern-matching on it does not work.

@raphael-proust correctly commented on EMFILE. You can see documentation for it in the man pages for pipe(2) on your system.

aantron commented 5 years ago

(however, I think fds are closed automatically by the OS when a process is terminated, so that's not likely to be the underlying issue, at least not in a simple way. the other end of pipes may be holding on to descriptors of dead peers, though)

FardaleM commented 5 years ago

Closing this for now, please re-open if you have more info.

Ok, I will try to reduce the code to have a smaller exemple. I was not able to reproduce the error with node and master on the same computer.

The CamlinternalLazy.Undefined error is strange, because what is raised at that line in map.ml (part of the compiler standard library) is Not_found. This makes me wonder if the exception printer is printing the exception correctly. I wonder if you are marshalling values or exceptions in a strange way.

I do not use any marshalling. I use json to pass value between node and master.