New fragmentation run - Githubissues

Waztom commented 1 month ago

Matteo is providing a new set of molecules that need to be added to the graph database. The expectation is that there will be about 200M molecules. To process this we need to re-instate the fragmentation machinery and as the database will now not be co-located with the compute cluster we need top make minor changes to the process to copy files between clusters.

We aim to process this data as a new dataset (e.g. starting with an empty database) and then use the combine play to combine it with the old data. Hence the process will look like this:

standardize the new molecules (cluster)
load standardized molecules into the database (database)
extract out the molecules to be fragmented (database)
fragment (cluster)
load fragmented data into database (database)
extract out molecules needing generation of additional info (e.g. InChi) (database)
generate additional info (cluster)
load additional info into database (database)
generate nodes and edges csv files (databse)
combine with existing data (cluster)

We plan to run a small test dataset through this process to make sure that it's running properly.

mwinokan commented 3 weeks ago

Chris Reynolds is able to assist after shutdown (after week 1 of Sept).

@Waztom asks if we can move this job to the DLS cluster (not STFC). Either way @tdudgeon says we need significant resources, which are available at DLS (150TB of GPFS).

@tdudgeon assumes that functionality the DLS cluster will be the same as STFC as they both use SLURM. @Waztom says we can leverage @ConorFWild's experience rather than relying on Chris.

Object storage for IRIS/STFC has not been investigated by @Waztom or @mwinokan.

The CPU requirement is around 2000 cores. @mwinokan has briefly checked and there are around 56 nodes with 64 cores and 500GB of RAM idle, both gpfs03 and gpfs04 filesystems are mounted.

The PostGres server will need to be available as well. Importing the PostGres volume into the DLS cluster will need Diamond IT/SC assistance.

alanbchristie commented 2 weeks ago

A simple diagram illustrating non-cluster elements (like DB and NFS server) along with the "expected" shared filesystem: -

mwinokan commented 2 weeks ago

@ConorFWild says that obtaining 2000 concurrently available cores will need SC to increase the job limits. Graham is the SC contact for this. It's estimated that 2000 cores will take a week or two

Conor suggests that many jobs with fewer cores each will be more friendly to not interrupting beamline processes. I.e. 2000 jobs with a core each, instead of 1 job with 2000 cores.

@alanbchristie to create a SC request ticket. SChelpdesk@diamond.ac.uk

alanbchristie commented 2 weeks ago

SC Request ID: SCHD-5779

alanbchristie commented 2 weeks ago

The SC Request has been shutdown and closed as "Won't Do".

I have forwarded the email but clearly they is no desire to support execution outside of Iris. I will step aside on this topic as there is nothing more I can do.

mwinokan commented 4 days ago

@alanbchristie has made some progress, it will be done on the development cluster. 6 new machines with 380 have been created, and playbook work has been initialised. Alan is optimistic and will do a dry-run tomorrow. In addition to existing resources the number of CPU's will be about 3x less than the previous run which took around 3 weeks (on the galaxy cluster).

Matteo is back online tomorrow and the apparently the data is ready and somewhere on /dls. ~250M compounds. The method of the compound selection process will need to be documented as well.

m2ms / fragalysis-frontend

New fragmentation run #1486