iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

50 TB test #68

Closed gordonwatts closed 4 months ago

gordonwatts commented 4 months ago

Ready for a week 5 data test

See if we can understand #52 (river POD's worse) - or if it is still true.

A major new thing for running against the raw dataset: we have a large amount of the data read out.

gordonwatts commented 4 months ago

Ran out of space in s3 - there is only 15 TB, and our 50 TB test needs 10 TB - the 200 TB test will need 40 TB!

And we have a data rate limit around 40 Gbps

image

And finally, around 800 workers we start getting networking errors.

gordonwatts commented 4 months ago

A complete run without errors:

image

Several things:

gordonwatts commented 4 months ago

Longer term: pod scaling needs to be re-done.

gordonwatts commented 4 months ago

Now that the data is being skimmed, hard, we are seeing much better performance.

image

That is with 1000 pods (100 Gbps).

gordonwatts commented 4 months ago

The servicex app is under stress when this is happening:

image

We have 5 cores allocated to do the work, and it is basically using them all.

gordonwatts commented 4 months ago

Switched to using two pods for the app.

gordonwatts commented 4 months ago

Done:

image

image

gordonwatts commented 4 months ago

Final size in S3: 502.2 GB Number of files in S3: 64.7K

In short - it dropped very few files this time (if any)!

gordonwatts commented 4 months ago

Second test, with similar number of transformers, but... now we have 2 service x pods and things look better:

image

gordonwatts commented 4 months ago

And the database is working on nvme:

image

gordonwatts commented 4 months ago

Using 1000 from AF and 500 from river, until the end when river went up to 1000:

image

So, 130 Gbps consistent!!

image

gordonwatts commented 4 months ago

Going to try an even higher cut. 100 GeV now.

gordonwatts commented 4 months ago

image

Had 1500 in AF and 1000 in river by the end. Didn't see the 130 rate from before.

gordonwatts commented 4 months ago

The next question is - how to proceed with SX testing. The following is my opinion, but others should feel free to jump in! Things that we should have before running another large scale test: Scripts to run on multiple datasets at once Retry implemented in the backend side-car that pushes data to S3 (we are seeing too many failures to get the data files in). And making sure any failures are logged (for now at least to the log collector, but eventually as a transformation failure) front-end retry when querying the transform after submitting it to get the request id - or fix the 5 second timeout - many transforms are lost because the servicex app seems to be "frozen" dealing with 64K files. This could also be a bug fix in the servicex_app when the datafiles are already cached. DASK processing error on empty files is not properly dealt with (at least, I think that is the error that I see on highly skimmed files) That list includes things in the IDAP script, the servicex_frontend, and in the core code of servicex itself. :confused: Once we have those things fixed, I'd be ready to run new large scale SX tests. Other things that would be nice to have fixed: Retry on transformer startup - because the transformer gets up and running before the side-car does, and when it tries to contact the sidecar it fails, which leads to a restart, which leads to 10-20 seconds of cycle time, which is hours of cycle time when running 2000 pods. A command line addition to the servicex command that return the size and number of files in a bucket for a request id A way to "save" a cache key(hash)/request-id so that you can "remember" a query as you move from one place to another. This could be command line. Other option: do this in the servicex_app Understand the way the xaod is compressing output files, and if it isn't using ZSTD, convert to using that. Add code that knows the number of events in each sample to produce a "Hz" measurement.

gordonwatts commented 4 months ago

Finally, some things @Ilija Vukotic and I learned about the system (Ilija, please add more conclusions!)

  1. If you ask SX to do a straight copy of the data you read in, it doesn't really do well. In short - SX was designed to skim and thin and write things out. Do that.
  2. Compressing the output data takes a significant amount of time. It was what was preventing us from getting from a max of 45 Gbps. Removing that and we go to 130 Gbps.
  3. Postgress with the DB on nvme seemed to be able to handle the load. We needed two pods running the servicex_app to keep up.
  4. There are bugs for dealing with the very large datasets that make it impossible to run the full test currently.
  5. If your skim efficiency is low enough, it looks like you don't need to look at whole baskets of the detailedata - and that reduces the read rate, which improves things overall.
  6. Improvements and modifications to the DNS and a reduction in the data we were pushing to S3 means that river was now able to run with out, seeming, errors.
gordonwatts commented 4 months ago

This test has now "finished"... we need to make some changes before running another large test.