DIRAC-on-Rucio - Githubissues

gabrielefronze commented 5 years ago

The LIGO and Virgo communities are stepping into Rucio to manage ste storage elements of their infrastructures. While LIGO is provided of proprietary HPC clusters, Virgo isn't and instead relies on a set of academic computing centers. The characteristics of the latter, as well as some previous choices, strongly point to a wide adoption of DIRAC in Virgo, while the interoperability with LIGO requires Rucio as storage manager/orchestrator.

After some discussion we (maybe) found a nice solution. Instead of developing a DIRAC plugin able to interface it to Rucio in a POSIX-like manner, we think the best option is to create a "DIRAC-mode" for the Rucio catalog. This argument follows a comment of many people: DIRAC was born to create a uniform interface to different GRID implementations, but ended up managing both Computing Elements (CE) and Storage Elements (SE). This choice was taken to make DIRAC aware of the geographical data position in order to minimize the data transfers. Since DIRAC is used by a rather small community not a lot of organizations might be interested in stepping in the development of an integration. Instead Rucio is much more appealing and finding a way to keep in the Rucio's external catalog enough topological information to make DIRAC happy and efficient might be the way.

In addition we discovered that DIRAC allows to pass it a custom catalog and some Virgo people has already performed some tests in that sense, creating an LFC catalog dump and running a DIRAC instance on that data. In fact such solution might scale up to decoupling the DIRAC storage function from the computing function, giving lots of benefits even to the DIRAC product. This could bring in some of their developers to assist in the process. Worth mentioning that DIRAC jobs can be any kind of executable, from an sh script to an executable available cluster(s)-wide (e.g. firefox...). Since the read from Rucio-managed files should be supported OOTB by DIRAC, the need of a plugin is needed only to register on Rucio the output files of the jobs. However, since DIRAC can run basic scripts, at first the registration of output files on Rucio might be handled by by-hand calls to the Rucio API from within the DIRAC job. If we think about a C++ compiled executable which produces a myfile.root file running on a shell myexe myfile.root. Wrapping the same in something like myexe myfile.root && <rucio_API_call> myfile.root should do the basic tricks.

bbockelm commented 5 years ago

Just a small note as I've been working with LIGO to utilize OSG and EGI-based computing centers for multiple years now. While "proprietary HPC clusters" characterizes at least one LIGO resource, there's actually a wide range of resources available.

Regardless, that's neither here nor there. The rest of the post looks good!

gabrielefronze commented 5 years ago

Hi Brian,

Thanks for the head up. Indeed I was just intending to highlight the differences between our two computing infrastructures, not to be "detailed"! :) Cheers!

Gabriele

brucellino commented 5 years ago

Hi guys, fascinating discussion, this is something that we have mulled in @EGI-Foundation for a bit as well. It would be out of place to make sweeping statements about DIRAC without the developers involved, so maybe this issue could be pointed out to them if that hasn't already been done.

My 2c is that there there are two patterns right now in developing these platforms

Build a core product, discover need for some other functionality, tack it on, ???, Profit!!!
Build a core product, discover need for some other functionality, set up a contract with another set of services to do that, ??? Profit

We have often looked at DIRAC as an HTC solution, but it's way more than that and just using it as an HTC solution is actually quite hard. I hazard to say that it works best when it's the primary interface for users and applications.

Rucio on the other hand is (forgive me for projecting my own perception here), a fantastic data management system. It could (is?) tack on compute management as well. As a product, we (say, EGI), would like it to interoperate with other services like cloud compute, HTC, HPC, etc, via stable APIs and do it's data management thing.

It would be nice to know if Rucio could be used as a drop-in replacement data catalogue for DIRAC, and more interesting to know if DIRAC could be used as a drop-in compute orchestration service for Rucio. My personal feeling is that something that does compute orchestration only would be a better fit (maybe, HTCondor, I don't have a great answer here, sorry).

Thanks! (usual disclaimer of "these opinions are mine and mine alone", "this does not represent the position of EGI, EGI Foundation etc" apply here :wink: )

gabrielefronze commented 5 years ago

Hi Bruce,

I am pointing some DIRAC people to this issue! Cheers,

Gabriele

fstagni commented 5 years ago

Hi, I am DIRAC technical coordinator, and right now its main developer. I've been pointed here, I will try to give some advice.

As mentioned above DIRAC give you the possibility to work with different, and even multiple, catalogs. Just to mention some real-life use case, which are the ones working best:

CLIC uses DIRAC with the DFC (DIRAC File Catalog) both as a "Replica Catalog" and as a "Metadata Catalog"
Belle2 uses DIRAC with the LFC (LCG File Catalog) as "Replica Catalog" and AMGA as "Metadata Catalog"
LHCb uses DIRAC with the DFC as "Replica Catalog" and the LHCb Bookkeeping as "Metadata and provenance Catalog". Some years ago LHCb moved its production replica catalog from the LFC to the DFC
all the other users that I know either use the DFC (the majority) or the LFC or some other combination.

The DFC, the LFC, AMGA, the LHCb Bookkeeping are all "Catalogs". In DIRAC terminology, in fact, they are all Catalog plug-ins, as a DIRAC Catalog is such if it implements the same interface (e.g. add file, remove, etc). All catalogs implement the same interface and inherit from https://github.com/DIRACGrid/DIRAC/blob/integration/Resources/Catalog/FileCatalogClientBase.py You can have more than 1 catalog at the same time, as obvious from the examples above. In this case, the operations will be executed on all of them. So, for example you can register files on BOTH the LFC and DFC at the same time. Basically, each catalog plug-in implements a certain operation (e.g. the addFile operation) following its own "interpretation" of what, e.g. adding a file means for a certain catalog.

So, what may be interesting for you, is implementing a RucioCatalogClient.py. The rest is purely configuration.

gabrielefronze commented 5 years ago

Hi @fstagni,

thank you for joining the conversation. Indeed that was the solution thought about at first, but I have some questions to ask:

How is the synchronization between two separate catalogs (the DIRAC one and the external one) dealt with?
What are the required keys for the external catalog to implement to make it fully compatible with DIRAC? Is case the external catalog doesn't provide topological information, how does DIRAC retrieves such details?
Is there a list of the write and read methods to be implemented in the FileCatalogClientBase.py custom derived interface?
What are the methods that are needed by DIRAC to make it efficient at least as with the native catalog?

Thank you,

Gabriele

fstagni commented 5 years ago

There's always a Master catalog, this is defined in the DIRAC CS (Configuration System). Just to be sure: you don't need to necessarily have the "DIRAC one". I repeat: a catalog is a catalog as long as it behaves like one.
What's a topological information for you? The location of the replicas?
Each catalog plugin "announce" what it can do. Otherwise there's a really basic list of required methods, just check the code for that.
DIRAC needs nothing at all in that sense. The DFC is just another service.

gabrielefronze commented 5 years ago

Awesome. I think the Rucio catalog can generate the missing info on the fly using functions, hence it should be possible to plug it to DIRAC;
yes, for example. As far as I have understood, DIRAC tries to perform the computation as close as possible to data. How does it figures out how to do that? Is the catalog providing some information or is it doe using some external metrics?
I saw the list of the mandatory READ_METHODS = [ 'hasAccess', 'exists', 'getPathPermissions' ]. It seems to me (and makes totally sense) that DIRAC requires to be able to read data, but the persistent output of data is not strictly required for the computing functionalities.
Several times in your publications there is an indication of a link between DIRAC's computing efficiency and the catalog information. Can you clarify a bit about that?

In addition I would like to ask you if there is any example of a custom implementation of FileCatalogClientBase.py (e.g. for LFC) to read and understand more the integration process.

Thank you

Gabriele

fstagni commented 5 years ago

I am not sure which are the "missing info"... ? Can you elaborate?
This is not fully correct. DIRAC CAN make the computation "close" to the data, but this is not a requirement. In fact you can run productions even in "full mesh" mode, meaning jobs can in theory go everywhere independently of the location of the input files. 2a. A "replica catalog" at least provides you with the location of the replicas. This location is a DIRAC Storage Element. This info is used in several places, but a DIRAC SE can simply be "RucioSE" if this is something you want.
DIRAC jobs decide if and where to store 0/1/N of their outputs. The DataManager object is what links the functionalities of FC and SE, and it's often the starting point for simple DM operations
Well... the DIRAC DFC is fast, efficient, practical, customizable, widely used, and it's already there. I can't compare it to any other solution apart the LFC, and I am not in a position of comparing it with the Rucio Catalog because I don't know much about it.
For examples just look in https://github.com/DIRACGrid/DIRAC/tree/integration/Resources/Catalog : LcgFileCatalogClient.py is the LFC, FileCatalogClient is the DFC (and I would not suggest to look at the others because they are a bit less obvious). For LHCb https://gitlab.cern.ch/lhcb-dirac/LHCbDIRAC/blob/master/LHCbDIRAC/Resources/Catalog/BookkeepingDBClient.py is the LHCb Bookkeeping.

gabrielefronze commented 5 years ago

and 2.: I see that DIRAC can operate in "fully mesh" mode, but in any other case the locality of data and computing resources should be described. The 1. question is about that: what information is needed for not to operate in "fully mesh".
Correct me if I am wrong: if I get to use Rucio catalog from DIRAC I should be able to publish output files to Rucio BY HAND using direct calls to Rucio APIs. To perform a basic test of the eventual FileCatalogClientRucio.py implementation it should be enough to use direct calls instead of a custom DIRAC plugin.
The question was more about "how much of the DIRAC efficiency is due to custom information stored in/computed from the DFC?"
Thanks for the references

fstagni commented 5 years ago

DIRAC would need to know where the input data is to make a proper job scheduling. Data is always in at least one "Storage Element". The SE(s) needs to be described in the DIRAC CS.
Of course
TLDR: nothing. DIRAC is a (not small) set of components. Each component has its own life, and there's no need to install all of them. In fact many installations only install a small subset of them. The DFC is just a DIRAC component. In fact, from a informed user's perspective it's just a URL. The interrogations and answers that it gives need to follow a certain contract, nothing else. If another catalog respects that contract, then you're done. The RucioFileCatalogClient.py file should be the only one of the whole DIRAC where you do import rucio.

cserf commented 3 years ago

A Small update on this ticket. A RucioFileCatalogClient is now available in Dirac. It was merged into v7r0: https://github.com/DIRACGrid/DIRAC/pull/5067 The patch also contains a RucioSynchronizerAgent and a RucioRSSAgent that are used to synchronize Dirac Configuration Service and Rucio. What is currently missing is to write and setup the tests to validate it. The setup of the tests will require to create a Rucio instance that can be used in Github Actions. Any help is welcome. If anyone is interested to work on this, please get in touch with me. Caveat: The implementation of RFC is based on the Belle II one and although we tried to be collaboration agnostic, there might be a few things to change for the other communities.

bari12 commented 2 years ago

I am closing this ticket now, it's largely an overview ticket anyway. Specific changes wich still need to be addressed are in the issue tracker in Rucio under the DIRAC label or on the DIRAC tracker.

rucio / rucio

DIRAC-on-Rucio #1808