Closed rsignell-usgs closed 6 years ago
In principle nothing stops us from doing this. It would probably take someone who was enthusiastic about it, perhaps because they had credits or data on Azure.
On Thu, Jan 18, 2018 at 1:00 PM, Rich Signell notifications@github.com wrote:
Is anyone thinking about deploying the pangeo framework on Azure?
Seems it would be cool because we could actually do some simulations there also: https://blogs.msdn.microsoft.com/azure_4_research/2017/08/ 01/cloud-computing-guide-for-researchers-real-hpc-with- linear-scaling-on-thousands-of-cores/ and then we would have the whole simulation modeling workflow on the Cloud.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAUGTEE951GwcPIemCp9-WrI0PUiks5tL4bSgaJpZM4RjVJE .
Interestingly enough, Microsoft announced Kubernetes for Azure last fall (about a week before the AWS announcement). So looks like everyone is onboard the Kubernetes bandwagon.
Maybe one of our European colleagues would be interested in applying for this Feb 15 opportunity: https://www.microsoft.com/en-us/research/academic-program/azure-research-ai-earth-european-union-oceans-award/
We applied to the AI for earth program in December and expect to hear back any day.
I just received word that we won the AI for Earth award from Microsoft. So we will have Azure credits very soon.
We will also have a few computer science interns at Columbia working on this.
Since Azure apparently supports kubernetes, the main challenge will be to plug zarr into Azure blob storage.
I think this means we need the equivalent of s3fs or gcsfs for Azure. Is any such package under development already?
https://github.com/Azure/azure-data-lake-store-python . I don't think it has a MutableMapping though, that would presumably need to be built.
On Fri, Jan 26, 2018 at 9:36 AM, Ryan Abernathey notifications@github.com wrote:
I just received word that we won the AI for Earth award from Microsoft. So we will have Azure credits very soon.
We will also have a few computer science interns at Columbia working on this.
Since Azure apparently supports kubernetes, the main challenge will be to plug zarr into Azure blob storage https://docs.microsoft.com/en-us/azure/storage/blobs/storage-python-how-to-use-blob-storage .
I think this means we need the equivalent of s3fs or gcsfs for Azure. Is any such package under development already?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-360800854, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszF2EPtsPK3SIi9DuGMp3j8beAvqrks5tOeLugaJpZM4RjVJE .
Would building the MutableMapping be a suitable project for a computer science undergrad? Can you estimate the amount of work involved?
cc @tjcrone, who is very interested in this
Probably. Here are the other two implementations
https://github.com/dask/s3fs/blob/master/s3fs/mapping.py https://github.com/dask/gcsfs/blob/master/gcsfs/mapping.py
I would want to know the undergrad before making the estimate. This would also require getting the involvement of the Azure devs on that project to verify that they will accept this work. CC'ing @martindurant, who knows more here
On Fri, Jan 26, 2018 at 9:42 AM, Ryan Abernathey notifications@github.com wrote:
Would building the MutableMapping be a suitable project for a computer science undergrad? Can you estimate the amount of work involved?
cc @tjcrone https://github.com/tjcrone, who is very interested in this
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-360802333, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszKpk2kPcXIiP1qzfSRLPdqn8FZElks5tOeQ_gaJpZM4RjVJE .
For Azure, such a thing would fit naturally in this mini-repo https://github.com/dask/dask-adlfs Of course, the problem with azure is setting up a testing service. The core implementation looks pretty similar to the other file-system back-ends.
I am a PI on the recently funded Azure grant with @rabernat, and would like to get Pangeo deployed on this platform for our intern program which we are just about to launch. I am very interested in making this happen and I'm willing to devote lots of time to it, but I would be grateful for community help as I get started. Advice on recommended first steps appreciated for sure.
@tjcrone , after sorting out an azure machine to run on, I would start by making a mapping class for adlfs, and submit it to https://github.com/dask/dask-adlfs (or it's own repo, or try to submit upstream to adlfs). Testing with azure will be tricky, you'll have to get to grips with vcrpy, or rely on testing only against live instances (i.e., not CI).
@tjcrone , the reason is simple, the data-lake support already exists. I am certain a blob storage interface would be easy enough to create, especially if the SDK is complete enough, but we have no manpower or test resources to do this at the moment.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
I'd be interested in seeing Pangeo on Azure and would also add that I think the blob store is a better option than the datalake. I have some ability to help test things on my Azure instance.
Why blob rather than datalake? Given that the latter already has a library, and I'm not sure how well bytes-range requests are supported in blob, I would go with the one we know.
For starters datalake costs more than twice as much as blobstore.
Datalake: "Optimized storage for big data analytics workloads", "Hierarchical file system", "No limits on account sizes, file sizes or number of files", seems like the obvious win.
I have a Pangeo deployment on Azure. Works great. In terms of interfacing with storage, I have a zarr fork with an abs_store function that works well with Azure blob. Other storage options work as well, including storage to GCP. Happy to answer any questions.
I guess blob is similar to S3 and GCS in that it is an object store. Azure datalake is trying to honor the traditional paradigm of hierarchical file systems to make you feel more like you're using an HPC parallel filesystem. It really wouldn't surprise me if datalake is backed by blob and is charging a premium for a file storage paradigm people are already comfortable with.
@tjcrone it would be great if the fork could be generalized into a python file system like S3FS and GCSFS. Zarr already has an abstract storage backend for dropping in new handlers. It would then be really interesting to compare performance between the two and see if it is worth the price.
@jacobtomlinson , the critical thing that we need is, can we access a random set of bytes within a blob? If not, then files can only be loaded in their entirety, which is not what any of the other FileSystems do, except for HTTPFileSystem. S3 and GCS also only simulate file hierarchies, as you say, I would expect that probably they are all identical in their backend implementations.
Blobs usually have http endpoints, and in the case of Google Drive, can definitely handle byte ranges. I'm not sure about whether S3 or Azure blob can handle byte ranges but it looks like Azure might. Performance in this mode is obviously a question that would need to be explored. I have run into "retry" limits (at something like 6000 retries) when using byte ranges with Google Drive, because every range request is treated as an unsuccessful retry for the entire file.
@jacobtomlinson, in my fork of zarr I implemented a mutable mapping that creates the ABSStore object for use with xarray.to_zarr, using the official Azure storage library. Minimal, but functional. Expanding this to an entire FS is an interesting idea, but is it worth doing since there is already a fairly capable storage library for Azure blob?
It would be cool if this were referenced in dask.bytes.core.get_mapper()
(which does not include adlfs, as it has no mapper), it would become more generally available to people wishing to use zarr-on-azureBlob with dask-arrays.
What is the right way to expose the existing Azure Blob Store MutableMapping? Should this be made generic from Zarr and from Dask as an external package?
On Thu, Jul 19, 2018 at 3:54 PM, Martin Durant notifications@github.com wrote:
It would be cool if this were referenced in dask.bytes.core.get_mapper() (which does not include adlfs, as it has no mapper), it would become more generally available to people wishing to use zarr-on-azureBlob with dask-arrays.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-406394689, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLbDRQPWmcFve9yUrZx3D0O8BLaZks5uIOPugaJpZM4RjVJE .
I imagine it should be an external package. If it is usable with zarr already, it should be seamless with dask or xarray, since you can pass a mapper object directly. get_mapper
would only need to import it for da.from_zarr
if using a URL like abs://
(guessing what the right protocol name for azure blob store might be).
FWIW I'd be happy with the azure blob store mapping class living in its own package or within the zarr.storage module. It's a relatively small amount of code, might feel like high overhead to create a package just for that.
On Thursday, 19 July 2018, Martin Durant notifications@github.com wrote:
I imagine it should be an external package. If it is usable with zarr already, it should be seamless with dask or xarray, since you can pass a mapper object directly. get_mapper would only need to import it for da.from_zarr if using a URL like abs:// (guessing what the right protocol name for azure blob store might be).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-406399367, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QrM2mSWcY_X9QGe8Oaszn-ifum9sks5uIOgCgaJpZM4RjVJE .
-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
I think it would be neat to be a separate package, partly just because it allows people to mentally compare it with S3FS, GCSFS, ADLFS, etc. I appreciate it is just the MutableMapping
part for now, but could be extended to cover the whole python filesystem interface.
@martindurant I know that S3 definitely support range gets (we use them in pysssix and I expect S3FS does too), I would be very surprised if blob doesn't .
Also @tjcrone (and everyone else) is your deployment listed here? If not I definitely recommend making a PR here.
Ah I do indeed vaguely remember there being multiple types of blobs.
Finally found time to catch up on this thread. Pretty excited to see that @tjcrone has a deployment on Azure and has the Azure Blob MutableMapping code in his fork. I'll be seeing if I can reproduce an Azure deployment this week using the script.
@tjcrone You say this is minimal at the moment, so what is it lacking? If it can read and write zarr files from Azure blob then I am very happy.
There are a few functions that will raise a NotImplementedError, so you will see those. But yes, it does read and write zarr, just fine. An example of using this can be found near the bottom of this notebook. The credentials import is just a file that has my account name and key in it. I would make this a yaml file and yaml.load these if I did this today. It's important to note that this code has not been extensively tested, so please be aware of that. If you find issues or see ways of improving this code, please let me know. Pull requests welcome! I think we should think about getting this into upstream if we are confident that it is working.
@tjcrone I took your branch of zarr and managed to successfully load some zarr data from the blobstore using a Standard D8s v3 (8 vcpus, 32 GB memory) instance running Ubunt 16.04. Python 3.5.2
import xarray as xr
import zarr
absstore = zarr.storage.ABSStore('test', '2011_foo.zarr', 'zzzzzz', 'xxxxx')
ds = xr.open_zarr(absstore)
Takes approx 137 seconds.
The zarr file is an 80Gb collection of data that is a combination of one years worth of data for 10 parameters. It was originally netCDF data.
ds
<xarray.Dataset>
Dimensions: (latitude: 241, level: 60, longitude: 480, time: 1462)
Coordinates:
* latitude (latitude) float32 90.0 89.25 88.5 87.75 87.0 86.25 85.5 ...
* level (level) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* longitude (longitude) float32 0.0 0.75 1.5 2.25 3.0 3.75 4.5 5.25 6.0 ...
* time (time) datetime64[ns] 2011-10-10 2011-10-10T12:00:00 ...
Data variables:
ddrt01 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt02 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt03 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt04 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt05 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt06 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
ddrt11 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
kjhf (time, latitude, longitude) float32 dask.array<shape=(1462, 241, 480), chunksize=(1, 241, 480)>
h (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
df3 (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
t (time, level, latitude, longitude) float32 dask.array<shape=(1462, 60, 241, 480), chunksize=(1, 60, 241, 480)>
If my understanding of how zarr works in correct, when we call open_zarr() we only read the .zarray and .zattrs files which are tiny. As such I am wondering why it takes nearly 2 minutes to open?
Info about the zarr directory:
ls -alFh * | grep .zattrs | wc -l
15
ls -alFh * | grep .zarray | wc -l
15
ls -alFh * | grep 0.0 | wc -l
16082
In the profiling output (using cProfile) I see largest contributor of total time is: ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
I think this results from the large number of contains calls that the absstore mutablemapping is making. See stack trace from snakeviz:
28. ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
27. ssl.py:568(read)
26. ssl.py:783(read)
25. ssl.py:918(recv_into)
24. socket.py:561(readinto)
23. ~:0(<method 'readline' of '_io.BufferedReader' objects>)
22. client.py:257(_read_status)
21. client.py:290(begin)
20. client.py:1153(getresponse)
19. connectionpool.py:319(_make_request)
18. connectionpool.py:446(urlopen)
17. adapters.py:393(send)
16. sessions.py:593(send)
15. sessions.py:445(request)
14. httpclient.py:68(perform_request)
13. storageclient.py:213(_perform_request)
12. baseblobservice.py:1596(exists)
11. storage.py:2135(__contains__)
10. storage.py:67(contains_array)
9. hierarchy.py:435(arrays)
8. zarr.py:269(<genexpr>)
7. utils.py:325(FrozenOrderedDict)
6. zarr.py:268(get_variables)
5. common.py:202(load)
4. conventions.py:380(decode_cf)
3. zarr.py:417(maybe_decode_store)
2. zarr.py:343(open_zarr)
1. <string>:1(<module>)
0. ~:0(<built-in method builtins.exec>)
ncalls | tottime | percall | cumtime | percall | filename:lineno(function)
16348 | 0.05889 | 3.602e-06 | 105.4 | 0.006448 | storage.py:2135(__contains__)
It seems from your comment here that you have also noticed this behaviour? Is there anything we can do to improve it?
Any comments appreciated.
This is the same issue as we have seen for gcsfs-based zarr, and occurs because xarray traverses all levels of the file hierarchy to read the .z* files at every level. See the discussion in https://github.com/zarr-developers/zarr/pull/268 - it should be possible to put all of the metadata into a single location and not have to do contains
, at least in the case of write-once-read-many datasets.
Could we use the built in azure blob list_blobs() api to build up a list in memory via a single API call and use this in the __contains__
call?
I haven't looked at the blob code, I'm sure that's possible, but you would still need to list all the files at some point, and would still need to open and read every one of the small files. With https, there is a minimum cost to each call of 100ms for google machines talking to google store - I imagine azure is similar.
In my case while I have many small files in the zarr "file" on blob store, only 15 or so of them are .z files. At file open time my understanding is that it only needs to read the .z files not all the files? Or am I wrong? I thought it only reads the compressed chunks when you actually start to use the data in the arrays.
Correct, I meant the small .z* metadata files. However, your data files should not be too small either - since you are listing files here, the total number of all files might be important.
Listing all the blobs using list_blob() would be a handful of API calls depending on the batch size. Something like:
from azure import *
from azure.storage import *
blob_service = BlobService(account_name='<accountname>', account_key='<accountkey>')
next_marker = None
while True:
blobs = blob_service.list_blobs('<containername>', maxresults=100, marker=next_marker)
next_marker = blobs.next_marker
print(next_marker)
print(len(blobs))
if not next_marker:
break
print "done"
Compared to the 16K calls it is making now.
@dazzag24, your memory idea is an interesting work-around, and might work. However I think it would be much better to have a solution that lives higher up the chain because there really should be a better approach than looking for a .zarray inside every blob. However, solving this inside the MutableMapping might get around many of the issues being discussed in zarr-developers/zarr#268. This would presumably only work if you did not change the store, which is also not ideal.
BTW @tjcrone We've found some bugs in getsize() as well. We are working on a fix. Pull request to follow.
Collecting all metadata into a single file is going to be necessary for large datasets anyway. This seems to me like the right path forward long-term. Having metadata scattered throughout many small files in a large dataset won't scale well.
@martindurant has done this process manually for some particular zarr files and it seems to have solved the problems that you're running into. It seems like in https://github.com/zarr-developers/zarr/pull/268 he and other Zarr devs are working to solidify this into a workable process for users more generally. I recommend that people suffering from metadata overheads to check out that issue and weigh in if they think it's not going in the right direction.
On Thu, Aug 2, 2018 at 9:35 AM, dazzag24 notifications@github.com wrote:
In my case while I have many small files in the zarr "file" on blob store, only 15 or so of them are .z files. At file open time my understanding is that it only needs to read the .z files not all the files? Or am I wrong? I thought it only reads the compressed chunks when you actually start to use the data in the arrays.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-409928163, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIp7KUbqEiaai4IlzsqSOGhu3X2rks5uMwAcgaJpZM4RjVJE .
@mrocklin @martindurant Combined metadata approach is good solution. When would this likely be in a released version?
@tjcrone Do you know the likelihood of your ABSStore MutableMapping getting pushed upstream? and the likely timescales? I intend on doing more testing with further datasets, but wanted to see what performance improvements we can make the zarr_open() first.
This discussion is probably best continued in zarr-developers/zarr#268, but one idea would be to have the MutableMapping return a metadata structure that it constructs using whatever tools it has available to it, which in the case of ABSStore and @rabernat's GCSStore could be done quite efficiently with perhaps just one API call, so that zarr can use this metadata for the entirety of a read operation.
There was some discussion of the ABSStore in https://github.com/zarr-developers/zarr/issues/255, and I think @alimanfoo is quite amenable to folding it in. I just think it requires more testing and some documentation. But my guess is that the timeline could be quite short if we push on it hard.
Very happy to move forward on the consolidated/combined metadata feature, I think we're converging on a way forward.
Also happy to have ABSStore come into zarr if that would be the most convenient way to go, if someone is happy to come on board as a zarr core dev to maintain it.
In addition to the above, I do wonder if all of those contains() calls are necessary. Would be worth looking at that separately I think.
Also if there are ways of using the Azure API to optimise, worth pursuing IMO.
One other thing to mention, zarr does support a feature where metadata and chunks are stored separately and accessed via separate MutableMappings. E.g., chunks and metadata could be within the same bucket but under different paths. This can provide a way to avoid listing all the chunks when all you want is the metadata. May not be the most convenient option but worth having in the back of the mind when exploring technical ideas.
On Thu, 2 Aug 2018 at 14:54, Tim Crone notifications@github.com wrote:
There was some discussion of the ABSStore in zarr-developers/zarr#255 https://github.com/zarr-developers/zarr/issues/255, and I think @alimanfoo https://github.com/alimanfoo is quite amenable to folding it in. I just think it requires more testing and some documentation. But my guess is that the timeline could be quite short if we push on it hard.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/82#issuecomment-409934247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QgCqhnrJ3HgBTNm-sM41t-YefIO9ks5uMwSmgaJpZM4RjVJE .
-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
@alimanfoo and @mrocklin, do you think it would be a terrible idea to at this stage have zarr ask the MutableMapping for metadata, and provide it in some way so that zarr wouldn't have to hit every blob with contains(), and then perhaps at a later date if it is seen as necessary, alter the zarr structure to contain metadata which the MutableMappings could draw from? For MutableMappings that cannot provide this metadata, zarr falls back to the old methods. This solution seems like an easier way forward, requiring no file format changes initially, but does not preclude them in the future.
The reason I am suggesting this is because I think that the MutableMappings that use the official APIs could generate metadata quickly and efficiently, even for large stores. For local stores this is likely the case as well.
Is anyone thinking about deploying the pangeo framework on Azure?
Seems it would be cool because we could actually do some simulations there also: https://blogs.msdn.microsoft.com/azure_4_research/2017/08/01/cloud-computing-guide-for-researchers-real-hpc-with-linear-scaling-on-thousands-of-cores/ and then we would have the whole simulation modeling workflow on the Cloud.
(BTW, @rabernat told me about this at ESIP Winter Meeting. I thought there were no Cloud providers who had InfiniBand)