Open tischi opened 3 years ago
@tischi : want to add me to this repo so I can assign myself?
Do you have any info on the sizes of all these volumes? I'm running a recursive s3 ls
now but it's taking ages.
I'm running a recursive s3 ls now but it's taking ages.
That's the issue with these millions of small files...doing anything with them but lazy loading chunks is not much fun. I will try to check....
...running now du -sh sbem-6dpf-1-whole-raw.n5
but that also takes time........................
2.0T sbem-6dpf-1-whole-raw.n5
18G sbem-6dpf-1-whole-segmented-cells.n5
and the other one is much smaller.
Doh. My command was done after my 🏃 :
$ aws-embl-public s3 ls --summarize --human-readable --recursive s3://platybrowser/rawdata/sbem-6dpf-1-whole-raw.n5/
...
Total Objects: 4049381
Total Size: 2.0 TiB
I am curious how long it will take to copy and zarrify this. Maybe would be interesting to time it for future reference.
The transfer is progressing extremely slowly. Do you have a small example dataset I could use to write a script and then you could transform locally?
Edit: actually, I'm now getting permission denied when I try to access s3.embl.de!
The myosin data set in the list above is small.
Edit: actually, I'm now getting permission denied when I try to access s3.embl.de!
Interesting, this may be related to this: https://github.com/mobie/mobie/issues/18
No luck:
$ aws --endpoint-url=https://s3.embl.de --no-sign-request s3 ls s3://platybrowser/
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)
$ aws --no-verify-ssl --endpoint-url=https://s3.embl.de --no-sign-request s3 ls s3://platybrowser/
/usr/lib/fence-agents/bundled/botocore/vendored/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied.
I will write IT...
We also have the exact same data on a file system. Should I zip it and then provide you a download link for this? Alternative is that we zarrify it at EMBL, but then we still need to zip and send you I think...
Yeah, if you can provide me a small- to mid-size download, I'll get started on a script and/or docker you can run.
just to keep track what I do in case I have to repeat:
/home/tischer/.local/lib/aws/bin/aws
srun --pty -c 2 -t 60:00 --mem 16000 bash -l
cd /g/cba/tischer/software/
ln -s /home/tischer/.local/lib/aws/bin/aws aws
bash-4.2$ ./aws configure --profile tischi
AWS Access Key ID [None]: tischi
AWS Secret Access Key [None]: xyz
Default region name [None]:
Default output format [None]:
bash-4.2$ ./aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 ls s3://idr-upload/tischi/
2020-11-10 16:39:33 84 README.txt
./aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 cp --recursive /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/0.6.3/images/local/prospr-6dpf-1-whole-mhcl4.n5 s3://idr-upload/tischi/prospr-6dpf-1-whole-mhcl4.n5
note: is is important to add the root folder to the upload destination
@joshmoore
I uploaded one file: prospr-6dpf-1-whole-mhcl4.n5
Can you read it?
sbatch -c 2 -t 10:00:00 --mem 16000 -e /g/cba/tischer/tmp/err.txt -o /g/cba/tischer/tmp/out.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 sync --quiet /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/1.0.1/images/local/sbem-6dpf-1-whole-segmented-cells.n5 s3://idr-upload/tischi/sbem-6dpf-1-whole-segmented-cells.n5
started...
sacct --format="JobID,State,CPUTime,MaxRSS"
I think it finished:
-bash-4.2$ sacct --format="JobID,State,CPUTime,MaxRSS"
JobID State CPUTime MaxRSS
------------ ---------- ---------- ----------
6315838 COMPLETED 06:23:16
6315838.bat+ COMPLETED 06:23:16 119908K
6315838.ext+ COMPLETED 06:23:24 352K
Took 6.5 hours, seems to have arrived.
-bash-4.2$ /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 ls s3://idr-upload/tischi/
PRE prospr-6dpf-1-whole-mhcl4.n5/
PRE sbem-6dpf-1-whole-segmented-cells.n5/
PRE test/
PRE test2/
PRE test3/
PRE test5/
2020-11-10 16:39:33 84 README.txt
2020-11-11 12:47:47 22 attributes.json
TODO:
sbatch -c 2 -t ???:00:00 --mem 16000 -e /g/cba/tischer/tmp/err.txt -o /g/cba/tischer/tmp/out.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 sync --quiet /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5
@joshmoore
Based on above experiment, if I extrapolate how long it would take to upload the 3D volume EM raw data using aws sync
, I get:
2000 GB / 18 GB * 7 hours / 24 hours = 32 days
Any thoughts?
@constantinpape
Do you know how long it took to copy /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5
onto our local S3 storage?
@martinschorb Do you know tricks to speed up copying to an S3 object store? I think you looked into this a bit, did you?
One idea could be to start several copy processes, e.g., parallelising over the resolution layers:
-bash-4.2$ ls /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/
attributes.json s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
I would think both our local file system and Josh's the receiving s3 storage should handle 10 parallel processes.
I found that it was much faster from a 3dcloud VM than from the cluster. But that could be specific to the network connnectivity to the s3 machines.
Do you know how long it took to copy
/g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5
onto our local S3 storage?
I think about a day. I used a cluster node (gpu6 or 7 probably).
@joshmoore @tischi
I am not sure if this is helpful, but I could also convert the data to zarr on the EMBL side.
I am not sure if this is helpful, but I could also convert the data to zarr on the EMBL side.
I think this is very interesting indeed, but @joshmoore should comment, because I don't know whether he needs some specific zarr flavour.
Any thoughts?
Not immediately, unless you want to also try tar'ing it up.
I am not sure if this is helpful, but I could also convert the data to zarr on the EMBL side.
@constantinpape : if you want to kick off a n5-copy
from n5 to zarr then I'll send you a script for the rest of the conversion. That being said, it would still be good to have the files on our servers for testing.
I think I'll just start it, resolution layer by resolution layer...
Edit: Sorry I wrote this before tischis last comment. If you want to do it Tischi, Go ahead.
@constantinpape : if you want to kick off a
n5-copy
from n5 to zarr then I'll send you a script for the rest of the conversion. That being said, it would still be good to have the files on our servers for testing.
~~I would probably use a python script I have set up for this. If we want to do it on the embl side it would be best to test on one of the smaller volumes first, so I do the conversion then run the script from @joshmoore and we see if the result matches.~~
Let's start with myosin, I will convert it later.
sbatch -c 8 -t 100:00:00 --mem 16000 -e /g/cba/tischer/tmp/err_s9.txt -o /g/cba/tischer/tmp/out_s9.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 sync --quiet /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s9 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s9
This finished instantly... @constantinpape Could it be that this level is empty?
-bash-4.2$ ls /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s9/0/0/0
/g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s9/0/0/0
I would probably use a python script I have set up for this.
:+1: for however it happens but the equivalent, yeah. :+1:
I would probably use a python script I have set up for this.
+1 for however it happens but the equivalent, yeah. +1
@joshmoore ok, let's see if we can get Tischi's conversion to run first and then have this as a fallback.
This finished instantly... @constantinpape Could it be that this level is empty?
I can't log into VPN right now, will check later. (But I can tell you already that the data is probably very small at s9 ;))
@tischi s9 has exactly one chunk, which is 41kb, so I would expect it to copy almost immediately:
pape@gpu7:/g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s9$ ls -lh 0/0/0
-rw-r--r-- 1 pape kreshuk 41K 12. Feb 2020 0/0/0
sbatch -c 8 -t 100:00:00 --mem 16000 -e /g/cba/tischer/tmp/err_s3.txt -o /g/cba/tischer/tmp/out_s3.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 sync --quiet /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s3 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s3
...level s3 with sync took 1h 48min
// All levels above are done with sync, now proceeding with cp --recursive, maybe its faster since it does not have to check? We can then add missing chunks with sync later, I guess.
sbatch -c 8 -t 100:00:00 --mem 16000 -e /g/cba/tischer/tmp/err_s2.txt -o /g/cba/tischer/tmp/out_s2.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 cp --quiet --recursive /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s2 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s2
sbatch -c 8 -t 100:00:00 --mem 16000 -e /g/cba/tischer/tmp/err_s1.txt -o /g/cba/tischer/tmp/out_s1.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 cp --quiet --recursive /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1
sbatch -c 8 -t 100:00:00 --mem 16000 -e /g/cba/tischer/tmp/err_s0.txt -o /g/cba/tischer/tmp/out_s0.txt /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 cp --quiet --recursive /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s0 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s0
Here's a quick script which looks to be working locally. I'm unsure if setups are always channels for this data and if there's ever more than one channel and/or setup.
$ ./convert.py prospr-6dpf-1-whole-nachr.zarr prospr-6dpf-1-whole-nachr.ome.zarr
$ ome_zarr info prospr-6dpf-1-whole-nachr.ome.zarr/
/opt/data/tischi/prospr-6dpf-1-whole-nachr.ome.zarr [zgroup]
- metadata
- Multiscales
- data
- (1, 1, 519, 471, 500)
- (1, 1, 260, 236, 250)
- (1, 1, 130, 118, 125)
- (1, 1, 65, 59, 63)
#!/usr/bin/env python
# This assumes that n5-copy has already been used
import argparse
import zarr
parser = argparse.ArgumentParser()
parser.add_argument("input")
parser.add_argument("output")
ns = parser.parse_args()
zin = zarr.open(ns.input)
sizes = []
def groups(z):
rv = sorted(list(z.groups()))
assert rv
assert not list(z.arrays())
return rv
def arrays(z):
rv = sorted(list(z.arrays()))
assert rv
assert not list(z.groups())
return rv
setups = groups(zin)
assert len(setups) == 1 # TODO: multiple channels?
for sname, setup in setups:
timepoints = groups(setup)
for tname, timepoint in timepoints:
resolutions = arrays(timepoint)
for idx, rtuple in enumerate(resolutions):
rname, resolution = rtuple
try:
expected = sizes[idx]
assert expected[0] == rname
assert expected[1] == resolution.shape
assert expected[2] == resolution.chunks
assert expected[3] == resolution.dtype
except:
sizes.append((rname,
resolution.shape,
resolution.chunks,
resolution.dtype))
datasets = []
out = zarr.open(ns.output, mode="w")
for idx, size in enumerate(sizes):
name, shape, chunks, dtype = size
shape = tuple([len(timepoints), len(setups)] + list(shape))
chunks = tuple([1, 1] + list(chunks))
a = out.create_dataset(name, shape=shape, chunks=chunks, dtype=dtype)
datasets.append({"path": name})
for sidx, stuple in enumerate(groups(zin)):
for tidx, ttuple in enumerate(groups(stuple[1])):
resolutions = arrays(ttuple[1])
a[tidx, sidx, :, :, :] = resolutions[idx][1]
out.attrs["multiscales"] = [
{
"version": "0.1",
"datasets": datasets,
}
]
I think I'll just start it, resolution layer by resolution l
I'm unsure if setups are always channels for this data and if there's ever more than one channel and/or setup.
For now we always have a single setup, corresponding to a single channel.
I don't know if this is a problem in zarr.n5.N5Store
or in the data prospr-6dpf-1-whole-nachr.n5
data I've been looking at, but not having "n5": "2.0.0"
in the intermediate groups leads to the exception:
In [43]: list(zarr.hierarchy.Group(store=zarr.n5.N5Store("/opt/data/tischi/prospr-6dpf-1-whole-nachr.n5")).groups())
...
ValueError: group not found at path 'setup0'
whereas if I edit the file I get:
In [44]: list(zarr.hierarchy.Group(store=zarr.n5.N5Store("/opt/data/tischi/prospr-6dpf-1-whole-nachr.n5")).groups())
Out[44]: [('setup0', <zarr.hierarchy.Group '/setup0'>)]
I have written these files with z5py
, which for n5 only write the version attribute to the root, as specified in
https://github.com/saalfeldlab/n5#file-system-specification point 3.
Did this maybe change recently to be more in line with the zarr group metadata? (It shouldn't without changing major version because I think this would be a breaking change.)
Or is it just a bug in the zarr.n5store
?
Anyway, for now we can fix it by adding the attributes to find the underlying issue.
@joshmoore I had a closer look at the script you posted now, and I think we can do the same thing directly from our n5s and with much less copying around of the data. I implemented a script that should do this here: https://github.com/constantinpape/i2k-2020-s3-zarr-workshop/blob/main/data-conversion/to_ome_zarr.py
Note that I am using z5py
to read the n5 datasets because of the issue with the group level attributes, otherwise one could also use zarr.
Also, I am storing the zarr array with a NestedDirectoryStore
; I would really prefer if we can do that otherwise the large datasets really overwhelm the FS. But if it's not supported yet we could also switch to the standard with flat hierarchy for the chunks.
Or is it just a bug in the
zarr.n5store
?
Yes. https://github.com/zarr-developers/zarr-python/pull/651
I implemented a script that should do this here:
:+1: I'll look more tomorrow.
But if it's not supported yet we could also switch to the standard with flat hierarchy for the chunks.
It previously wasn't on the zarr side, so in the ome-zarr spec it's prevented. I agree! I'd very much like to move to nested storage in the next version bump.
Ok, I updated it to support the flat chunk hierarchy.
@joshmoore Feels like the storage is very slow for some reason. Maybe because you copy from it? This only returns with a timeout for me:
aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 ls s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s0/
Does it work for you?
Ah, possibly. I've canceled my mirror
command. Let me know if it looks to be faster.
...
...point0/s0/120/159/1: 159.27 GiB / 159.27 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 1.83 MiB/s 24h44m10s
real 1484m16.708s
user 23m28.571s
sys 23m16.119s
@joshmoore Still slow (see my mail for a theory) for me. Is it faster for you?
I think you are right that the lower paths are struggling under the number of subelements. Certainly listing the top .n5 works (--> setup0
). Listing it on the server returns fine (50 elements).
@joshmoore
If you can, could you please let me know the result of an ls
for these three folders?
sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s0/
sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1/
sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s2/
From the results I hope to deduce what has been copied already such that I do not start the sync in more subfolders than necessary.
@tischi Sure!
@joshmoore Using sync I am now getting this error:
bash-4.2$ /g/cba/tischer/software/aws --profile tischi --endpoint-url=https://idr-ftp.openmicroscopy.org s3 sync /g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1/94 s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1/94
fatal error: Read timeout on endpoint URL: "https://idr-ftp.openmicroscopy.org/idr-upload?list-type=2&prefix=tischi%2Fsbem-6dpf-1-whole-raw.n5%2Fsetup0%2Ftimepoint0%2Fs1%2F94%2F&encoding-type=url"
any ideas?
@joshmoore And using cp I am also getting an error:
upload failed: ../../g/arendt/EM_6dpf_segmentation/platy-browser-data/data/rawdata/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1/94/100/0 to s3://idr-upload/tischi/sbem-6dpf-1-whole-raw.n5/setup0/timepoint0/s1/94/100/0 An error occurred (ServiceUnavailable) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.
Maybe the server is kind of down?
@joshmoore
All those xml point to n5s3 datasets: https://github.com/mobie/platybrowser-datasets/tree/master/data/1.0.1/images/remote Within the xml you can see all information needed to access the object in the bucket.
Cool would be to have those converted to zarr: