sccn / liblsl

C++ lsl library for multi-modal time-synched data transmission over the local network
Other
107 stars 63 forks source link

`pull_chunk(std::vector<std::vector<T>>)` doesn't take `max_chunk_len` into account #163

Open dmedine opened 2 years ago

dmedine commented 2 years ago

In the C++ API, this overload reduces to pulling 1 sample at a time until there are no more samples available: https://github.com/sccn/liblsl/blob/master/include/lsl_cpp.h#L1213-L1222

This is not ideal for debugging purposes because if you are stepping though code that uses an inlet that uses this overload, you will inevitably be in a position where you have many, many samples in the queue and then all of a sudden you are pulling not, say, 10 samples at a time, but tens of thousands. This can cause annoying errors for consumers of the data and possibly stack overflow errors with the vector itself.

It would be simple enough to use max_chunk_len to limit the number of samples that can be pulled in the definition of the overload, except in the case that max_chunk_len is 0, in which case it will need to grab that information from the inlet. I haven't had the time to look into that, but I don't know if that is possible as of now. I guess that the inlet would have to put that information into the info object that it shares with the inlet.

dmedine commented 2 years ago

I perused the source and am still not clear how (or even if) maxChunkLen is being taken into account. However, I ran some tests and found the following.

  1. when calling pull_chunk(vector<vector<T>>, ...) on an inlet, maxChunkLen has no effect. This makes perfect sense since this overload reduces to pulling sample by sample until no more samples are left to pull. Memory for the sample container is handled dynamically on the stack (someone correct me if I am wrong about that) via std::vector.
  2. when calling pull_chunk_multiplexed(T*, ...) maxChunkLen is somewhat respected, but not always. I ran the following C++ program and let the samples buffer up by breaking on line 18 (int foo = 1;). If maxChunkLen were working properly, sz should never be more than 40. But sometimes it is.

Basically, AFAICT, in the C++ API, at least, setting maxChunkLen on an inlet does nothing. It also appears that at no time in the connection handshake does the outlet tell the inlet what its specified chunk size is.

What should be the case, according to documentation, is that a client (inlet) can specify a max chunk length, and if it is 0, the server (outlet) will set that for the client. Neither appears to be true.

#include "lsl_cpp.h"
#include <vector>
const int maxChunkLen = 104;
int main()
{
    lsl::stream_inlet inlet(lsl::resolve_stream("type", "EEG")[0], 360, maxChunkLen);
    std::vector<std::vector<float>> data;
    float* multiplexed_data = (float*)malloc(maxChunkLen * sizeof(float));
    double* timestamps = (double*)malloc(maxChunkLen/8 * sizeof(double));
    while (1)
    {

        //if (double ts = inlet.pull_chunk(data))
        if (size_t sz = inlet.pull_chunk_multiplexed(multiplexed_data, timestamps, maxChunkLen, maxChunkLen/8))
            int foo = 1;
    }
}
cboulay commented 2 years ago

I think it's a really unfortunate parameter name. maxChunkLen has nothing to do with the chunk size in pull_chunk*. Instead, maxChunkLen specifies the maximum chunk size for outlet->inlet transport. i.e., how many samples can get buffered up before they are forced across the network.

The only reason you might want to change maxChunkLen is to increase it (up to a point) to get more efficient network transport, or to reduce it to remove latency at the cost of higher CPU usage.

However, it doesn't even do that properly. See this other similar issue I ran into:

https://github.com/sccn/liblsl/issues/96

dmedine commented 2 years ago

Ah, you are right. I misunderstood the inline documentation. Then maybe this needs to be on a wish list. It is a lot of fiddly work trying to make max_chunk_len/maxChunkLen actually be that for an inlet.

A simple improvement would be to add an appropriately named optional param just on the inlet side that does do as advertised, e.g. maxChunkLengthOnPull. However, I think this is rather confusing given the name of the other param.

In any case, I don't think that this is actually doing anything. From reading the source code, I also suspect that setting max_buflen on the inlet side has no effect, but I haven't tested it.

cboulay commented 2 years ago

BTW, when you use pull_chunk_multiplexed, after skipping a few hops you eventually end up here: https://github.com/sccn/liblsl/blob/e16db0524f52ee74288f313d63f361e2499fda26/src/stream_inlet_impl.h#L205-L211

It will pull as many samples as are available up until max_samples (the number of samples in the buffer / number of channels as provided in the 3rd argument).

And I totally agree that there's something not working here, and it doesn't help that the naming and docstrings are confusing. I've had the exact same confusion at least a couple times.

cboulay commented 2 years ago

I can wave my hands and say "asio is doing some advanced packetization to optimize the transport", and that's great, but we should be able to override it. Otherwise the max_chunklen parameter has no purpose.

dmedine commented 2 years ago

BTW, when you use pull_chunk_multiplexed, after skipping a few hops you eventually end up here:

But that is only because you need to say how big your buffers are (C-style) in pull_chunk_multiplexed. If you are using pull_chunk(vector<vector<T>>,...) you can't.

This all came about because I want to add support for List<T> in my C# wrapper. So far, I do exactly what the C++ API does, but now I can't debug some of my client code because of this issue. That being said, for completeness' sake, I should probably add PullSampleMultiplexed( T[],...) to my wrapper as well.

dmedine commented 2 years ago

"asio is doing some advanced packetization to optimize the transport"

Hundred percent, but even so, it shouldn't matter, right? If we wish to pull more samples then ASIO is willing to hand us, then we wait for more, if we ask for less, the extra should just be sitting in the queue waiting to get pulled. That's the way it works in my head, anyway.

cboulay commented 2 years ago

Hmm... one thing we could do in pull_chunk(vector<vector<T>>,...) is see if the outer vector has a reserved size via checking its capacity(), and if it's non-zero then we can assume it has been pre-allocated and that's the max size the user wants to return.

Would that meet your use case?

Note that vector<vector<T>> is not a great container for this. The channels within a single frame are contiguous, but the last sample in frame t is not contiguous with the first sample in frame t+1, because of the extra bookkeeping in the inner vector. I don't know C# types very well. Will List<T> still work?

cboulay commented 2 years ago

Actually, it's not so straightforward. Let's say a user is reusing the same vector and they don't reserve. On their first pull_chunk the vector will allocate memory to store however many samples are available. On all subsequent pulls, they will be limited to however many samples were available on the first pull, even if this was a small number. It seems dangerous.

We could add a new flag to stream_inlet to see if the first pull attempt had non-zero capacity and if so then subsequent pulls can use capacity as the defacto max_samples. But now we're getting into crazy town.

It's probably just better to add an optional , size_t max_samples = 0 parameter.

dmedine commented 2 years ago

The C# doesn't wrap the C++, but rather the C implementation. So the API uses List instead of vector which follows roughly the same rules.

Also, in the C++ API (and my C# wrapper by the same token) I don't think the non-contiguity between frames is a big performance problem. I believe (could be wrong) that at the implementation level, the non-contiguous memory from frame to frame would be just pointers to pointers. That's about as fast as you can get. pull_chunk_multiplexed is there to avoid allocating memory, but I would be surprised if it really speeds much up. Each frame is allocated on the stack, so under 'good' conditions, it should be pretty darn fast to grab that space.

Also, vector is good for other reasons, e.g. familiarity, simplicity, etc. ...but I digress.

dmedine commented 2 years ago

It's probably just better to add an optional , size_t max_samples = 0 parameter.

Fully agree. Giving the client container the power to determine this is not a good idea.

dmedine commented 2 years ago

This is the dumb way I did it in my v.0 C# wrapper: https://github.com/Diademics-Pty-Ltd/LibLSL/blob/main/src/lib/StreamInlet.cs#L170-L195

This is wrong, though, since the variable means something else (that doesn't appear to work...): https://github.com/Diademics-Pty-Ltd/LibLSL/blob/main/src/lib/StreamInlet.cs#L38

cboulay commented 2 years ago

RE speed and contiguous samples -- one place where this is really useful is when using numpy. If you preallocate a C-contiguous array of shape (max_samples, channels) and correct type, you can pass this memory directly to the DLL and when that call returns your array is all set and ready to use in Python. Well, you have to remember to valid_view = array[:n_returned]. Without using this numpy memory, an extra list-of-lists manipulation is required, which in Python can be quite expensive. I don't have the numbers but I profiled these two approaches and the savings was substantial. I use the numpy memory approach with pylsl whenever I can.

dmedine commented 2 years ago

That could very well be, but the python API, if I understand it correctly, isn't calling the C++ API, but rather the C API directly. The vector business is only in the C++ (even though underneath it fills the vectors with calls to the C API). I would be interested to do a performance test using the C++ API and see who wins---or rather by how much. I am sure the pre-allocated method will be faster.