iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
5 stars 1 forks source link

Vector read errors through XCache #96

Open alexander-held opened 2 months ago

alexander-held commented 2 months ago

We see some reproducible errors that look like

OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3000] Read vector is invalid

when reading specific branches. They seem to specifically happen when reading through XCache and disappear otherwise. Reproducer below, requires x509 credentials.

import uproot

treename = "CollectionTree"

# without XCache, this works
fname = "root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/mc20_13TeV/f5/99/DAOD_PHYSLITE.37230013._001196.pool.root.1"

with uproot.open({fname: treename}) as f:
    f["AnalysisJetsAuxDyn.SumPtChargedPFOPt500"].array()

# this one breaks
fname = "root://192.170.240.148//root://fax.mwt2.org:1094//pnfs/uchicago.edu/atlaslocalgroupdisk/rucio/mc20_13TeV/f5/99/DAOD_PHYSLITE.37230013._001196.pool.root.1"

with uproot.open({fname: treename}) as f:
    f["AnalysisJetsAuxDyn.SumPtChargedPFOPt500"].array()
matthewfeickert commented 2 months ago

Reproducer below, requires x509 credentials.

This comment is just to add additional context/instructions (though probably anyone reading this already knows the following)

To get your x509 proxy probably the easiest way is to just use the ATLAS VOMS proxy on a machine has setupATLAS available through CVMFS

$ setupATLAS
$ lsetup rucio
$ voms-proxy-init -voms atlas

The proxy file path is stored in the X509_USER_PROXY environment variable

$ echo $X509_USER_PROXY
/tmp/x509up_u23203

and so this proxy is relocatable, meaning that inside a container mounted (like AB-dev) the variables X509_USER_PROXY and X509_CERT_DIR can be set inside the container to use the proxy.

A simple way to be able to make these findable from your user (on say the UCHicago AF) as your user in the mounted container would be to copy the credentials to your user area

# On SSH server (e.g. UCHicago AF)
$ env | grep X509 &> /tmp/x509_env_variables.txt
$ cp /tmp/x509_env_variables.txt ~/
$ cp "${X509_USER_PROXY}" ~/
# After done remember to delete ~/x509*

and then copy them into the container and set all the environmental variables

# In the mounted container
$ cp ~/x509* /tmp/
$ while IFS= read -r line; do export "${line}"; done < /tmp/x509_env_variables.txt
traceback of python /tmp/issue-96.py: ```pytb Traceback (most recent call last): File "/tmp/issue-96.py", line 15, in f["AnalysisJetsAuxDyn.SumPtChargedPFOPt500"].array() File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 1819, in array _ranges_or_baskets_to_arrays( File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 3105, in _ranges_or_baskets_to_arrays uproot.source.futures.delayed_raise(*obj) File "/venv/lib/python3.9/site-packages/uproot/source/futures.py", line 38, in delayed_raise raise exception_value.with_traceback(traceback) File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 3026, in chunk_to_basket basket = uproot.models.TBasket.Model_TBasket.read( File "/venv/lib/python3.9/site-packages/uproot/model.py", line 854, in read self.read_members(chunk, cursor, context, file) File "/venv/lib/python3.9/site-packages/uproot/models/TBasket.py", line 227, in read_members ) = cursor.fields(chunk, _tbasket_format1, context) File "/venv/lib/python3.9/site-packages/uproot/source/cursor.py", line 201, in fields return format.unpack(chunk.get(start, stop, self, context)) File "/venv/lib/python3.9/site-packages/uproot/source/chunk.py", line 446, in get self.wait(insist=stop) File "/venv/lib/python3.9/site-packages/uproot/source/chunk.py", line 388, in wait self._raw_data = numpy.frombuffer(self._future.result(), dtype=self._dtype) File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 28, in result return self._parent.result(timeout=timeout)[self._part_index] File "/usr/AnalysisBaseExternals/25.2.2/InstallArea/x86_64-el9-gcc13-opt/lib/python3.9/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/usr/AnalysisBaseExternals/25.2.2/InstallArea/x86_64-el9-gcc13-opt/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/venv/lib/python3.9/site-packages/fsspec_xrootd/xrootd.py", line 641, in _cat_ranges results = await _run_coros_in_chunks(coros, batch_size=batch_size, nofiles=True) File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 268, in _run_coros_in_chunks result, k = await done.pop() File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 245, in _run_coro return await asyncio.wait_for(coro, timeout=timeout), i File "/usr/AnalysisBaseExternals/25.2.2/InstallArea/x86_64-el9-gcc13-opt/lib/python3.9/asyncio/tasks.py", line 442, in wait_for return await fut File "/venv/lib/python3.9/site-packages/fsspec_xrootd/xrootd.py", line 601, in _cat_vector_read raise OSError(f"File did not vector_read properly: {status.message}") OSError: File did not vector_read properly: [ERROR] Server responded with an error: [3000] Read vector is invalid ```
alexander-held commented 2 months ago

This is resolved by the latest uproot RC as the vector read is replaced by a new default with read coalescing. Might still be interesting / useful to understand but this degrades the issue in priority at least I'd say.

ponyisi commented 3 weeks ago

I'm actually kind of surprised here because I routinely had vector reads fail in the opposite way (no problem with xcache, fail with direct read from dCache). But as you say the latest uproot should address this.