50 TB test - Githubissues

gordonwatts commented 4 months ago

Ready for a week 5 data test

Run with dask cluster unallocated (so we don't kill CEPH, or die by it).
Try pre-allocated 100 workers.

See if we can understand #52 (river POD's worse) - or if it is still true.

A major new thing for running against the raw dataset: we have a large amount of the data read out.

gordonwatts commented 4 months ago

Ran out of space in s3 - there is only 15 TB, and our 50 TB test needs 10 TB - the 200 TB test will need 40 TB!

And we have a data rate limit around 40 Gbps

And finally, around 800 workers we start getting networking errors.

gordonwatts commented 4 months ago

A complete run without errors:

Several things:

Errors were caused by river not being able to sent complete messages back to SX
SO just used AF instances.
Did get a timeout error when running against client - trying to get in touch with SX.

gordonwatts commented 4 months ago

Longer term: pod scaling needs to be re-done.

gordonwatts commented 4 months ago

Now that the data is being skimmed, hard, we are seeing much better performance.

That is with 1000 pods (100 Gbps).

gordonwatts commented 4 months ago

The servicex app is under stress when this is happening:

We have 5 cores allocated to do the work, and it is basically using them all.

gordonwatts commented 4 months ago

Switched to using two pods for the app.

gordonwatts commented 4 months ago

Done:

gordonwatts commented 4 months ago

Final size in S3: 502.2 GB Number of files in S3: 64.7K

In short - it dropped very few files this time (if any)!

gordonwatts commented 4 months ago

Second test, with similar number of transformers, but... now we have 2 service x pods and things look better:

gordonwatts commented 4 months ago

And the database is working on nvme:

gordonwatts commented 4 months ago

Using 1000 from AF and 500 from river, until the end when river went up to 1000:

So, 130 Gbps consistent!!

gordonwatts commented 4 months ago

Going to try an even higher cut. 100 GeV now.

gordonwatts commented 4 months ago

Had 1500 in AF and 1000 in river by the end. Didn't see the 130 rate from before.

gordonwatts commented 4 months ago

The next question is - how to proceed with SX testing. The following is my opinion, but others should feel free to jump in! Things that we should have before running another large scale test: Scripts to run on multiple datasets at once Retry implemented in the backend side-car that pushes data to S3 (we are seeing too many failures to get the data files in). And making sure any failures are logged (for now at least to the log collector, but eventually as a transformation failure) front-end retry when querying the transform after submitting it to get the request id - or fix the 5 second timeout - many transforms are lost because the servicex app seems to be "frozen" dealing with 64K files. This could also be a bug fix in the servicex_app when the datafiles are already cached. DASK processing error on empty files is not properly dealt with (at least, I think that is the error that I see on highly skimmed files) That list includes things in the IDAP script, the servicex_frontend, and in the core code of servicex itself. :confused: Once we have those things fixed, I'd be ready to run new large scale SX tests. Other things that would be nice to have fixed: Retry on transformer startup - because the transformer gets up and running before the side-car does, and when it tries to contact the sidecar it fails, which leads to a restart, which leads to 10-20 seconds of cycle time, which is hours of cycle time when running 2000 pods. A command line addition to the servicex command that return the size and number of files in a bucket for a request id A way to "save" a cache key(hash)/request-id so that you can "remember" a query as you move from one place to another. This could be command line. Other option: do this in the servicex_app Understand the way the xaod is compressing output files, and if it isn't using ZSTD, convert to using that. Add code that knows the number of events in each sample to produce a "Hz" measurement.

gordonwatts commented 4 months ago

Finally, some things @Ilija Vukotic and I learned about the system (Ilija, please add more conclusions!)

If you ask SX to do a straight copy of the data you read in, it doesn't really do well. In short - SX was designed to skim and thin and write things out. Do that.
Compressing the output data takes a significant amount of time. It was what was preventing us from getting from a max of 45 Gbps. Removing that and we go to 130 Gbps.
Postgress with the DB on nvme seemed to be able to handle the load. We needed two pods running the servicex_app to keep up.
There are bugs for dealing with the very large datasets that make it impossible to run the full test currently.
If your skim efficiency is low enough, it looks like you don't need to look at whole baskets of the detailedata - and that reduces the read rate, which improves things overall.
Improvements and modifications to the DNS and a reduction in the data we were pushing to S3 means that river was now able to run with out, seeming, errors.

gordonwatts commented 4 months ago

This test has now "finished"... we need to make some changes before running another large test.

iris-hep / idap-200gbps-atlas

50 TB test #68