readBufferPopTimeout error

juan-guerschman commented 1 month ago

hey, relatively new to rios here.

I am running a process, all files read locally in my laptop.

I set

concurrency = applier.ConcurrencyStyle(readBufferPopTimeout=100)
controls.setConcurrencyStyle(concurrency)

but I still get the following error:

Error in compute worker 6
Traceback (most recent call last):
  File "C:\Users\Juan\anaconda3\envs\base-rs-new\lib\site-packages\rios\computemanager.py", line 192, in worker
    (blockDefn, inputs) = inBlockBuffer.popNextBlock()
  File "C:\Users\Juan\anaconda3\envs\base-rs-new\lib\site-packages\rios\structures.py", line 625, in popNextBlock
    raise rioserrors.TimeoutError(msg)
rios.rioserrors.TimeoutError: BlockBuffer timeout. Number of blocks already popped: 6
    Try increasing readBufferPopTimeout (current value = 10)

what could be the problem or how could I fix?

neilflood commented 1 month ago

I don't really understand how this could be correct. If the concurrency style is created with only that timeout value, and nothing else, it would not be doing any concurrency, and would not be using that part of the code (I think). Is that really all you gave to ConcurrencyStyle?

Are you passing that controls object to the apply() function? Does the program work without setting any concurrency style?

gillins commented 1 month ago

@juan-guerschman can you try running outside of Jupiter?

juan-guerschman commented 1 month ago

I don't really understand how this could be correct. If the concurrency style is created with only that timeout value, and nothing else, it would not be doing any concurrency, and would not be using that part of the code (I think). Is that really all you gave to ConcurrencyStyle?

Are you passing that controls object to the apply() function? Does the program work without setting any concurrency style?

yes, sorry, it's all new to me. that's the only setting I set up for concurrency, and yes, I pass it to the applier:

# Apply the function to add rasters
applier.apply(ml_soc, infiles, outfiles, otherArgs=otherargs, controls=controls)

what should I try first? Setting concurrency correctly (I'm not sure what else to do). Or try outside of jupyter?

juan-guerschman commented 1 month ago

sorry, actually I do something else, from a code I've seen elsewhere:

#controls.progress = cuiprogress.CUIProgressBar()
controls.setNumThreads(8)
controls.setJobManagerType('multiprocessing')
controls.windowxsize = 512
controls.windowysize = 512

I get the error:

 WARNING: setNumThreads and setJobManagerType are now deprecated (v2.0.0). Please use setConcurrencyStyle instead. Emulating jobManagerType 'multiprocessing'

but I suspect it's being fixed and done properly?

neilflood commented 1 month ago

Ah, OK, at least this now makes sense. Because of the use of the deprecated setNumThreads & setJobManagerType, it is ignoring the extra ConcurrencyStyle setting. You can't be doing both, so switch to the new one.

Delete the controls.setNumThreads and controls.setJobManagerType, and replace with the following

concurrency = applier.ConcurrencyStyle(
        numReadWorkers=4, 
        numComputeWorkers=4, 
        computeWorkerKind=applier.CW_THREADS)
controls.setConcurrencyStyle(concurrency)

This will do a reasonable job of concurrency, and you will be able to play with the number of read and compute workers to get the optimal performance.

When you used the old approach, it made a guess at some parameters, but not a very good guess, which then caused the timeout problem, which you were unable to address because it was ignoring the new concurrency style object.

Let me know how that goes.

gillins commented 1 month ago

@juan-guerschman what are you actually doing? The numReadWorkers=4 is mainly for increasing throughput when reading from S3 which is very high latency, but scales well with an increase in traffic. Your laptop probably has a fixed maximum throughput after which things will break down. So you might find this actually slows things down more than speeds them up... Also, your function needs to be quite CPU intensive before numComputeWorkers=4 helps. If it is a simple multiplication/addition you might find this doesn't help much either. Unless you are doing something really intensive like a convolution or ML etc. You'll also need to check that the functions you are calling release the GIL (https://realpython.com/python-gil/) before you get good multithreading. Most of numpy and scipy now do, but YMMV. Basically, what I'm saying is that you should try and run it without any sort of concurrency and see how long it takes first. If it is a reasonable timeframe then maybe you don't need to do anything else...

neilflood commented 1 month ago

Yes, I agree with all that @gillins says. I had assumed you were using an existing program, but if this is a new program, you should first run it without any concurrency, then, if necessary, slowly work out how to use concurrency to speed it up. Using concurrency is not trivial, so if you want to use that, please read the documentation at https://www.rioshome.org/en/latest/concurrency.html

juan-guerschman commented 1 month ago

ok, thanks. This is reading a bunch of rasters (all local), stacking them up and running a RandomForest model developed with sklearn. I copied and old code from from Peter Sc with a very similar use case that used multithreading. That's also why I did

controls.setNumThreads(8)
controls.setJobManagerType('multiprocessing')

trying with no concurrency would simply be all the same without passing controls.setConcurrencyStyle(concurrency) ?

neilflood commented 1 month ago

Just don't call setConcurrencyStyle at all. The default is no concurrency.

Thanks :-)

gillins commented 1 month ago

Ah yes sklearn is a good example where might conceivably improve with multiple compute workers... Still not sure about multiple read workers. Anyway, see how long it takes without anything. Might be worth printing the report (https://www.rioshome.org/en/latest/concurrency.html#quick-start) so we have something to guide us...

juan-guerschman commented 1 month ago

a small test area, with concurrency:

Wall clock elapsed time: 9.6 seconds

Timer                      Total (sec)
--------------------------------
reading                     21.9
userfunction                24.0
writing                      0.0
closing                      1.6
insert_readbuffer            0.0
pop_readbuffer               2.9
insert_computebuffer         0.0
pop_computebuffer            7.2

without:

Wall clock elapsed time: 17.9 seconds

Timer                      Total (sec)
--------------------------------
reading                      4.7
userfunction                11.8
writing                      0.0
closing                      1.0

I'll use concurrency and send my big area

neilflood commented 1 month ago

Thanks @juan-guerschman That looks like a great result. In the concurrency run, there is very little time spent waiting for the various buffers, and a good balance between time in reading and userfunction. Brilliant.

juan-guerschman commented 1 month ago

and this is the same process on a larger area:

Wall clock elapsed time: 3192.6 seconds

Timer                      Total (sec)
--------------------------------
reading                   8625.4
userfunction              9075.7
writing                     29.2
closing                    139.3
insert_readbuffer         3575.1
pop_readbuffer            2653.5
insert_computebuffer         1.1
pop_computebuffer         3022.3

neilflood commented 1 month ago

Thanks.

I can't really tell anything definite from just these numbers by themselves. As discussed in the doc page, there are so many interactions that the best approach to finding the optimal concurrency parameters is to play with different combinations and see which runs the fastest. From your initial runs without concurrency, it does appear that the compute time is probably more important than the reading time, so I would suggest focusing on that. Given that the files are on a local disk, the operating system will be doing lots of good work with disk cache and so on, and will itself require some CPU time to do its work, so don't have the total number of workers greater than the number of CPU cores on your machine. That can be useful when there is a lot of latency in the input device, but probably not in this situation.

Try varying the balance between read and compute workers. Given that the compute time seems large, you may benefit by having more compute workers than read workers, but you will still need some read workers to be able to keep up with the compute. So, try a few different combinations around where you have started, and see what helps.

It is always good to have a baseline to compare with, so also do a full run with no concurrency, so you know how long that takes. If you do a run which takes longer than that, then you are heading in the wrong direction.

I will close this Issue off now, as I think you are well on the way to understanding how to work with RIOS's new concurrency features. Up to you now to explore what works for this particular problem, on this particular hardware.

juan-guerschman commented 1 month ago

awesome, thanks Neil

ubarsc / rios

readBufferPopTimeout error #104