iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Re-tune the number of splits in the `uproot.dask` call #63

Closed gordonwatts closed 4 months ago

gordonwatts commented 4 months ago

Currently we split every file into 20 segments. Likely this is generating too many task over-heads. Try different numbers on a 1 TB dataset to see what we should be using. Err on the side of low numbers.

gordonwatts commented 4 months ago

Running on the 1TB MC file. Here is dask without pre-allocating workers:

image

0000.0333 - INFO - Using release 22.2.107 for type information.
0000.0338 - INFO - Building ServiceX query
0000.0339 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0000.3296 - INFO - Starting ServiceX query
0002.7202 - INFO - Running servicex query for cc92b8c9-546d-4b1c-8eff-44167988c85d took 0:00:00.289982 (no files downloaded)                                                                                                                                                                                                                                                                                                                     
0002.7297 - INFO - Finished ServiceX query
0002.7304 - INFO - Using `uproot.dask` to open files
0002.9302 - INFO - Generating the dask compute graph for 34 fields
0002.9306 - INFO - Field event_number is not a scalar field. Skipping count.
0002.9309 - INFO - Field run_number is not a scalar field. Skipping count.
0003.0303 - INFO - Number of tasks in the dask graph: 1020
0003.0304 - INFO - Computing the total count
0038.2793 - INFO - Done: result = 1,370,000
gordonwatts commented 4 months ago

Adn if we pre-allocate 100 workers:

image

0000.0377 - INFO - Using release 22.2.107 for type information. 0000.0381 - INFO - Building ServiceX query 0000.0382 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026. 0000.3251 - INFO - Starting ServiceX query 0002.9942 - INFO - Running servicex query for cc92b8c9-546d-4b1c-8eff-44167988c85d took 0:00:00.310064 (no files downloaded)
0003.0031 - INFO - Finished ServiceX query 0003.0040 - INFO - Using uproot.dask to open files 0003.1622 - INFO - Generating the dask compute graph for 34 fields 0003.1628 - INFO - Field event_number is not a scalar field. Skipping count. 0003.1631 - INFO - Field run_number is not a scalar field. Skipping count. 0003.2632 - INFO - Number of tasks in the dask graph: 1020 0003.2633 - INFO - Computing the total count 0020.9112 - INFO - Done: result = 1,370,000

so, not a huge improvement, but still, looks a lot better behaved!

gordonwatts commented 4 months ago

Question: why are there only 9 uproot tasks? We have a lot of files here. Is it grabbing only 10 files??

gordonwatts commented 4 months ago

Answer: I forgot to put in --num-files 0 to run all files. Will need to add an info message in there!

gordonwatts commented 4 months ago

Here is running on the full thing with 100 workers already setup and waiting (for about 1000 files):

0000.0478 - INFO - Using release 22.2.107 for type information.
0000.0490 - INFO - Building ServiceX query
0000.0492 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0001.6245 - INFO - Starting ServiceX query
0009.6245 - INFO - Running servicex query for 9aab9d71-e07d-4760-b013-b9a5decfa1d9 took 0:00:01.278860 (no files downloaded)                                                                                                                                                                                                                                                                                                    
0009.6400 - INFO - Finished ServiceX query
0009.6529 - INFO - Using `uproot.dask` to open files
0010.3201 - INFO - Generating the dask compute graph for 34 fields
0010.3216 - INFO - Field event_number is not a scalar field. Skipping count.
0010.3226 - INFO - Field run_number is not a scalar field. Skipping count.
0010.7340 - INFO - Number of tasks in the dask graph: 134078
0010.7342 - INFO - Computing the total count
0203.9237 - INFO - Done: result = 134,858,000

And with 100 workers pre-allocated:

image

Duration: 190.95 s
Tasks Information
number of tasks: 11304
compute time: 3hr 49m
disk-read time: 482.74 s
disk-write time: 88.93 ms
transfer time: 18m 9s

Scheduler Information
Address: tcp://172.16.107.222:8786
Workers: 100
Threads: 100
Memory: 372.53 GiB
Dask Version: 2024.4.1
Dask.Distributed Version: 2024.4.1
gordonwatts commented 4 months ago

With two splits per file we got a bit better:

0000.0479 - INFO - Using release 22.2.107 for type information.
0000.0487 - INFO - Building ServiceX query
0000.0488 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0001.6985 - INFO - Starting ServiceX query
0009.7907 - INFO - Running servicex query for 9aab9d71-e07d-4760-b013-b9a5decfa1d9 took 0:00:01.448535 (no files downloaded)                                                                                                                                                                                                                                                                                                    
0009.8488 - INFO - Finished ServiceX query
0009.8631 - INFO - Using `uproot.dask` to open files
0010.6286 - INFO - Generating the dask compute graph for 34 fields
0010.6301 - INFO - Field event_number is not a scalar field. Skipping count.
0010.6311 - INFO - Field run_number is not a scalar field. Skipping count.
0011.0448 - INFO - Number of tasks in the dask graph: 268142
0011.0449 - INFO - Computing the total count
0179.4656 - INFO - Done: result = 134,858,000
gordonwatts commented 4 months ago

And with n=3 it is getting worse - so lets leave it at 2 for now.

0000.0460 - INFO - Using release 22.2.107 for type information.
0000.0468 - INFO - Building ServiceX query
0000.0470 - INFO - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0001.6140 - INFO - Starting ServiceX query
0009.2382 - INFO - Running servicex query for 9aab9d71-e07d-4760-b013-b9a5decfa1d9 took 0:00:01.323622 (no files downloaded)                                                                                                                                                                                                                                                                                                    
0009.2547 - INFO - Finished ServiceX query
0009.2676 - INFO - Using `uproot.dask` to open files
0010.0131 - INFO - Generating the dask compute graph for 34 fields
0010.0146 - INFO - Field event_number is not a scalar field. Skipping count.
0010.0155 - INFO - Field run_number is not a scalar field. Skipping count.
0010.4416 - INFO - Number of tasks in the dask graph: 402206
0010.4418 - INFO - Computing the total count
0183.4070 - INFO - Done: result = 134,858,000
gordonwatts commented 4 months ago

Ok - final settings:

Approach n
local 20
scheduler 3
none 1

This will work for now, I suspect.