simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

CVMFS publishing server not working #27

Closed ickc closed 4 months ago

ickc commented 8 months ago

/cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/hello_world.sh is created on cvmfs-uploader02.gridpp.rl.ac.uk at Nov 17 15:58. As of writing, it still cannot be seen at vm77.

We need to document the expected time scale that the users can see deployed softwares.

C.f. #20

ickc commented 8 months ago

@rwf14f, @afortiorama, I think there's something wrong that I don't understand. I created a file /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/hello_world.sh and uploaded it using the publishing node cvmfs-uploader02.gridpp.rl.ac.uk at Nov 17 15:58. As of writing (Tue Nov 21 16:18:18 GMT 2023), this cannot be seen from vm77.tier2.hep.manchester.ac.uk.

The CVMFS on vm77 is setup by me, so it is possible there's a problem in my setup. However, it is able to read files inside /cvmfs/northgrid.gridpp.ac.uk/lsst for example.

Thanks.

ickc commented 8 months ago

I just tried it on a worker node wn3806190.tier2.hep.manchester.ac.uk and the directory are still not there.

rwf14f commented 8 months ago

There seems to be a problem with the publishing. I've contacted the admin and he will look into it.

ickc commented 8 months ago

Hi, @rwf14f, it is working on vm77 and wn3806201.tier2.hep.manchester.ac.uk now. I supposed the admin has fixed it and is closing.

ickc commented 7 months ago

I'm not sure if the publishing server has failed to sync again, or just have a very long synchronization delay. I unpacked the softwares around 3pm today and it is not yet available right now (~5:30pm).

On vm77:

$ ls /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm -lagh
Permissions Size User  Group Date Modified Name
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-mkl-x86-64-v3-mpich-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-mkl-x86-64-v3-openmpi-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-mkl-x86-64-v4-mpich-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-mkl-x86-64-v4-openmpi-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-nomkl-x86-64-v3-mpich-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-nomkl-x86-64-v4-mpich-20231212
drwxr-xr-x     - cvmfs cvmfs 12 Dec 13:59  so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231212

On publishing node:

$ ls /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm -alh
total 459K
drwxr-xr-x 26 northgridsgm cvmfsvos 34 Dec 14 16:17 .
drwxr-xr-x  7 northgridsgm cvmfsvos  8 Dec 14 16:16 ..
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-mkl-x86-64-v3-mpich-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-mkl-x86-64-v3-mpich-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-mkl-x86-64-v3-mpich-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 42 Dec 14 16:17 so-pmpm-py310-mkl-x86-64-v3-mpich-latest -> so-pmpm-py310-mkl-x86-64-v3-mpich-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-mkl-x86-64-v3-openmpi-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-mkl-x86-64-v3-openmpi-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-mkl-x86-64-v3-openmpi-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 44 Dec 14 16:17 so-pmpm-py310-mkl-x86-64-v3-openmpi-latest -> so-pmpm-py310-mkl-x86-64-v3-openmpi-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-mkl-x86-64-v4-mpich-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:12 so-pmpm-py310-mkl-x86-64-v4-mpich-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-mkl-x86-64-v4-mpich-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 42 Dec 14 16:17 so-pmpm-py310-mkl-x86-64-v4-mpich-latest -> so-pmpm-py310-mkl-x86-64-v4-mpich-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-mkl-x86-64-v4-openmpi-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-mkl-x86-64-v4-openmpi-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:11 so-pmpm-py310-mkl-x86-64-v4-openmpi-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 44 Dec 14 16:17 so-pmpm-py310-mkl-x86-64-v4-openmpi-latest -> so-pmpm-py310-mkl-x86-64-v4-openmpi-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-nomkl-x86-64-v3-mpich-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-nomkl-x86-64-v3-mpich-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-nomkl-x86-64-v3-mpich-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 44 Dec 14 16:17 so-pmpm-py310-nomkl-x86-64-v3-mpich-latest -> so-pmpm-py310-nomkl-x86-64-v3-mpich-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 46 Dec 14 16:17 so-pmpm-py310-nomkl-x86-64-v3-openmpi-latest -> so-pmpm-py310-nomkl-x86-64-v3-openmpi-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-nomkl-x86-64-v4-mpich-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-nomkl-x86-64-v4-mpich-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-nomkl-x86-64-v4-mpich-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 44 Dec 14 16:17 so-pmpm-py310-nomkl-x86-64-v4-mpich-latest -> so-pmpm-py310-nomkl-x86-64-v4-mpich-20231214
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 12 13:59 so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231212
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 13 01:11 so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231213
drwxr-xr-x 27 northgridsgm cvmfsvos 30 Dec 14 01:10 so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231214
lrwxrwxrwx  1 northgridsgm cvmfsvos 46 Dec 14 16:17 so-pmpm-py310-nomkl-x86-64-v4-openmpi-latest -> so-pmpm-py310-nomkl-x86-64-v4-openmpi-20231214
ickc commented 7 months ago

It becomes available just now to me. I'm not sure if you fix something and in that case thanks!

ickc commented 7 months ago

The publishing server is not syncing again. Could you take a look?

A bit more details: I'm on cvmfs-uploader02.gridpp.rl.ac.uk (I ssh into cvmfs-upload01.gridpp.rl.ac.uk but probably a load balancer send me there) and unpack a bunch of stuffs yesterday. None of them seems to be sync'd to both vm77.tier2.hep.manchester.ac.uk and wn3806190.tier2.hep.manchester.ac.uk. One example directory a bunch of new stuffs should be there is /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory/pmpm.

ickc commented 7 months ago

@rwf14f, @afortiorama,

There's one piece of detail that I'm not sure if it is relevant here. I'm writing directly to /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory instead of /cvmfs-mirror/northgrid.gridpp.ac.uk/simonsobservatory. Would that make a difference?

Edit: I found that whichever one I wrote to on the publishing node would have the same effect on the publishing node. But I'm not sure if they trigger syncing differently.

ickc commented 7 months ago

Following-up: somehow this is semi-resolved. Some of the environments seem to have synced over the holiday. For those I unpack 2 days ago, it is there now, but for those I unpack yesterday, it is not synced yet.

I have no idea why. Did someone fixed something?

Edit: Note that even if it is syncing, the delay still seems too long.

ickc commented 6 months ago

More insights from an email chain:


From Jose:

When I came back from Xmas holidays I noticed the last attempt to publish new content failed. Investigating it, the repo was corrupted at the Stratum-0 server, and I was not able to recover it. So I decided to recreate it from scratch. However, everytime I try to transfer all content from the Uploader to the Stratum-0, the rsync process fails [*] I keep investigating. But, for the time being, no new content is being published. I will let you know as soon as I have restore it.


From alessandra:

thanks for the alert. We should discuss how much space is needed also in view of moving this software to a dedicated Simon Observatory space I mentioned in another thread. GridPP gave the blessing to set it up. I don't know if it is better to start a new thread for the latter.


From Jose:

at this point, I am not sure there is really an issue with disk space on the server. Numbers don’t match. I need to understand what is happening.

For the new repository, which domain name do you want? “gridpp.ac.uk”? “egi.eu”?


From alessandra:

I'd prefer gridpp.ac.uk.

will we need to modify the configuration at the site? or is it going to be automatically configured centrally?


From Jose:

Configuration is based on domain names. So, if a site is already getting gridpp.ac.uk repos correctly, nothing needs to be done. I would need to create the new repo, a new UNIX account, and fix the configuration files at RAL. But that is my job :)


From Jose:

It seems to be a limitation of the total size for a single publishing operation. But, as I am trying to republish the entire repo from scratch, I think I am hitting that limit. What I am attempting now is to publish in chunkes. But that means that, until I am done, the Stratum-0 is going to be publishing unusable content.

rwf14f commented 6 months ago

There's one piece of detail that I'm not sure if it is relevant here. I'm writing directly to /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory instead of /cvmfs-mirror/northgrid.gridpp.ac.uk/simonsobservatory. Would that make a difference?

Do not copy anything directly into any of the high level directories yourself. This can cause problems with the deployment scripts. You should copy your software into $HOME/cvmfs_repo which is a link to the relevant cvmfs upload directory of the VO user account (eg. /cvmfs-mirror/northgrid.gridpp.ac.uk for northgrid). If you install the software in /cvmfs/northgrid.gridpp.ac.uk/ on your local machine then should copy the content of that directory into $HOME/cvmfs_repo/ on the uploader machine.

ickc commented 6 months ago

Do not copy anything directly into any of the high level directories yourself. This can cause problems with the deployment scripts.

The problem is that while this rule is mentioned somewhere, synchronization works previously even if /cvmfs instead of /cvmfs-mirror is used ((I have always copied to the /cvmfs/northgrid.gridpp.ac.uk/simonsobservatory directly and things were syncing). And it seems like the syncing problem is not caused by this (It seems the problem is related to the volume of data to sync instead from the email thread.)

ickc commented 4 months ago

The issue has since been resolved by Jose.