Closed gordonwatts closed 4 months ago
Ran out of space in s3 - there is only 15 TB, and our 50 TB test needs 10 TB - the 200 TB test will need 40 TB!
And we have a data rate limit around 40 Gbps
And finally, around 800 workers we start getting networking errors.
A complete run without errors:
Several things:
Longer term: pod scaling needs to be re-done.
Now that the data is being skimmed, hard, we are seeing much better performance.
That is with 1000 pods (100 Gbps).
The servicex app is under stress when this is happening:
We have 5 cores allocated to do the work, and it is basically using them all.
Switched to using two pods for the app.
Done:
Final size in S3: 502.2 GB Number of files in S3: 64.7K
In short - it dropped very few files this time (if any)!
Second test, with similar number of transformers, but... now we have 2 service x pods and things look better:
And the database is working on nvme:
Using 1000 from AF and 500 from river, until the end when river went up to 1000:
So, 130 Gbps consistent!!
Going to try an even higher cut. 100 GeV now.
Had 1500 in AF and 1000 in river by the end. Didn't see the 130 rate from before.
The next question is - how to proceed with SX testing. The following is my opinion, but others should feel free to jump in! Things that we should have before running another large scale test: Scripts to run on multiple datasets at once Retry implemented in the backend side-car that pushes data to S3 (we are seeing too many failures to get the data files in). And making sure any failures are logged (for now at least to the log collector, but eventually as a transformation failure) front-end retry when querying the transform after submitting it to get the request id - or fix the 5 second timeout - many transforms are lost because the servicex app seems to be "frozen" dealing with 64K files. This could also be a bug fix in the servicex_app when the datafiles are already cached. DASK processing error on empty files is not properly dealt with (at least, I think that is the error that I see on highly skimmed files) That list includes things in the IDAP script, the servicex_frontend, and in the core code of servicex itself. :confused: Once we have those things fixed, I'd be ready to run new large scale SX tests. Other things that would be nice to have fixed: Retry on transformer startup - because the transformer gets up and running before the side-car does, and when it tries to contact the sidecar it fails, which leads to a restart, which leads to 10-20 seconds of cycle time, which is hours of cycle time when running 2000 pods. A command line addition to the servicex command that return the size and number of files in a bucket for a request id A way to "save" a cache key(hash)/request-id so that you can "remember" a query as you move from one place to another. This could be command line. Other option: do this in the servicex_app Understand the way the xaod is compressing output files, and if it isn't using ZSTD, convert to using that. Add code that knows the number of events in each sample to produce a "Hz" measurement.
Finally, some things @Ilija Vukotic and I learned about the system (Ilija, please add more conclusions!)
This test has now "finished"... we need to make some changes before running another large test.
Ready for a week 5 data test
See if we can understand #52 (river POD's worse) - or if it is still true.
A major new thing for running against the raw dataset: we have a large amount of the data read out.