Process datasets as a whole even with storage unit datafile

RKrahl commented 4 years ago

I suggest to change the behavior of ids.server if the storage unit is set to datafile: at the moment, each datafile is archived and restored individually. The suggestion is to change that and to always archive and restore entire datasets, regardless of the storage unit. That is, the distinction of the storage unit would almost completely be dropped from the core of ids.server. The distinction would still be retained in the way how the archive storage is accessed: for instance the DfRestorer would then restore one single dataset rather an arbitrary list of datafiles, but as opposed to the DsRestorer, it would still read each file individually from archive storage, rather then a single zip file containing all the files from the dataset.

The main benefit would (hopefully) be a significant improvement of the performance in two aspects:

checking the status of requested data: for most user requests, we need to check whether the requested data is online. At the moment with storage unit datafile, each single datafile must be checked. This requires a fstat call on each file which is an expensive operation. Following the suggestion, only the presence of the dataset directory need to be checked.
restore: at the moment with storage unit datafile, in order to avoid to start thousands of individual threads, all pending restore requests in the queue are combined to a single restore call. That has the side effect that all concerned datasets need to be exclusively locked. These locks are kept until the full restore call is done. Following the suggestion, one restore thread per dataset would be started and the exclusive lock on that dataset would be released as soon as the restore of the dataset is done.

The only drawback I can see for the moment is that we get a coarser granularity on archive and restore operations. If a user requests one single datafile from a dataset having a large number of datafiles, the full dataset will be restored, not only the requested file.

The suggestion would keep compatibility with existing archive storage. However, the upgrade procedure for the main storage will be not trivial: before upgrading to a new version having this suggestion implemented, it must be ensured that only complete datasets are online.

dfq16044 commented 4 years ago

Currently at DLS, we have individual datasets with more than 1 million datafiles.

RKrahl commented 4 years ago

Yes, and that is exacly what causes the performance issues you reported. It means that only checking whether such a dataset is online costs you more than 1 milion fstats as opposed to one single fstat if this suggestion is implemented.

dfq16044 commented 4 years ago

The problem is restoring 1 million datafiles from the tape system in this case. This may add some stress on the tape library.

Regards,

Sylvie

From: Rolf Krahl notifications@github.com Sent: 28 February 2020 13:32 To: icatproject/ids.server ids.server@noreply.github.com Cc: Da Graca Ramos, Silvia (DLSLtd,RAL,LSCI) silvia.da-graca-ramos@diamond.ac.uk; Comment comment@noreply.github.com Subject: Re: [icatproject/ids.server] Process datasets as a whole even with storage unit datafile (#107)

Yes, and that is exacly what causes the performance issues you reported. It means that only checking whether such a dataset is online costs you more the 1 milion fstats as opposed to one single fstat if this suggestion is implemented.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/icatproject/ids.server/issues/107?email_source=notifications&email_token=ADH6KDZPVDLGSA3MZYGSLIDRFEHAFA5CNFSM4K5P22J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENIQX2A#issuecomment-592514024, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADH6KD453YAR5ECI4MOCCK3RFEHAFANCNFSM4K5P22JQ.

-- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

antolinos commented 4 years ago

I wonder, taking into account the available file formats (like HDF5), does it make sense to store 1M files per dataset? Would it be more efficient to address to the root of the problem instead of the IDS? We are reducing by x1000 just by using HDF5.

Just my opinion, A.

dfq16044 commented 4 years ago

Dear Alex,

I agree with you but I cannot do anything with data that is already in the database ... This is mainly processed data.

Regards,

Sylvie.

From: antolinos notifications@github.com Sent: 28 February 2020 15:59 To: icatproject/ids.server ids.server@noreply.github.com Cc: Da Graca Ramos, Silvia (DLSLtd,RAL,LSCI) silvia.da-graca-ramos@diamond.ac.uk; Comment comment@noreply.github.com Subject: Re: [icatproject/ids.server] Process datasets as a whole even with storage unit datafile (#107)

I wonder, taking into account the available file formats (like HDF5), does it make sense to store 1M files per dataset? Would it be more efficient to address to the root of the problem instead of the IDS? We are reducing by x1000 just by using HDF5.

Just my opinion, A.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/icatproject/ids.server/issues/107?email_source=notifications&email_token=ADH6KDYWR3JRXPOTT5BG7YLRFEYGRA5CNFSM4K5P22J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJAEVI#issuecomment-592577109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADH6KD6BJ5UWLLGZFM3WQQ3RFEYGRANCNFSM4K5P22JQ.

-- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

RKrahl commented 4 years ago

@antolinos : sure that makes sense. But still, it also makes sense to improve ids.server where its behavior is inefficient. Both are orthogonal paths of improvement that should be followed independently of each other. And finally it makes sense to use challenging cases such as the situation at DLS to detect and to understand inefficient behavior.

RKrahl commented 4 years ago

This depends on #109.

kevinphippsstfc commented 3 years ago

I just came across this during my work fixing IDS issues for Diamond and I'm afraid that I totally agree with @dfq16044. Due to the fact that some Diamond datasets contain so many files, the proposed behaviour would be disastrous for Diamond. I also agree with the comments that datasets should not have so many files in them, but this is data that has been ingested over the last 10+ years and cannot be just be deleted, tidied up or re-processed into nexus files, so for now we are stuck with it.

RKrahl commented 3 years ago

@kevinphippsstfc, I rather believe that the current behavior of ids.server to process the millions of datafiles in a dataset each one individually is disastrous for Diamond.

I regularly get complaints from Diamond about the poor performance of ids.server. This proposal is the direct result of an in-depth analysis of an event at Diamond from January 2019 that caused problems to the tape systems due to the particular pattern of sending restore requests in the current implementation of ids.server. The main cause of performance issues at Diamond is exactly the combination of datasets having a very large number of datafiles and the setting of storageUnit = datafile. The thorough solution would be switching to storageUnit = dataset. But I understand that this is impossible, because you cannot convert the legacy of ten years storage in the backend. This proposal is exactly tailored to your situation at Diamond and would bring you some of the performance benefits of storageUnit = dataset without the need to modify your tape archive. I still believe it would significantly improve the performance at Diamond.

kevinphippsstfc commented 3 years ago

Apologies @RKrahl I was not aware that this suggestion originated from Chris's email to the ICAT group. It's good that conversation is now linked into this issue. Also many thanks for looking into this - I appreciate that this is not trivial in itself, having spent quite some time trying to understand the IDS myself!

I did some further thinking about this, and I realised that it would not be easy to decide whether a Dataset is online for Diamond. Diamond Datasets do not have the location field populated (full path locations are in the Datafiles) and I presume this would be required, or perhaps some programmatic way to create a path to a top level Dataset folder unique to that Dataset. If that folder exists on main storage, then you assume that all the Datafiles within the Dataset are also online.

RKrahl commented 3 years ago

No need for apologies.

You hit a valid point: The decision whether a dataset is online is taken in the storage plugin, it needs to implement a method mainStorage.exists(DsInfo) (if storageUnit = dataset shall be supported by the plugin or if this proposal would get implemented). How this decision is taken is up to the plugin. When formulating this proposal, I unconsciously took for granted that each dataset has a dedicated folder in main storage and it would be easy to determine the path of that folder from the DsInfo attributes, because that is the case in the reference plugin ids.storage_file and also in our own plugin at HZB. (In both cases, they don't use the location attribute, but the folders in main storage are prescribed by the plugin itself.)

If this assumption that the plugin could implement mainStorage.exists(DsInfo) is not given, then there is indeed a problem.

icatproject / ids.server

Process datasets as a whole even with storage unit datafile #107