tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
701 stars 283 forks source link

Migration of modular file system plugins in tensorflow (s3/hdfs/gcs) to tensorflow-io #1183

Open yongtang opened 3 years ago

yongtang commented 3 years ago

As was discussed in our September monthly meetings, we are looking into migrate modular file system plugins in tensorflow (s3/hdfs/gcs) to tensorflow-io package.

Once the migration is done the size and build time of tensorflow repo could be substantially improved, this can greatly help external tensorflow contributors experience as they could selectively decide to only build some components (e.g., tensorflow core, tensorflow-io, tensorflow-addons, etc).

Currently it takes at least 8 hours (or even more) for many external tensorflow contributors to build complete tensorflow package due to the lack of bazel cache that is only internally available in google.

Now as we already have a couple of modular file systems in tensorflow-io (azure blob file system and http file system. see #1111), it might be time to start looking into migration of plugins in tensorflow to tensorflow-io.

/cc @mihaimaruseac @vnvo2409

Item list:

Follow up:

vnghia commented 3 years ago

Let me check if there is any incompatible between the core implementation and the plugins.

There are some points which I would like to make it clear.


Documentation: There are some issues of TF related to the behavior of the environment variables:

Up till now, I don't see much documentation about those filesystems. I would like to have a working example and documentation about other envs. Where should I make a PR to ?


Issues: Could we transfer the issues related to those filesystems from TF repo to this repo ?


Testing: What we should do about testing ? I think we could set up a s3 and gcs emulator in github actions. Another possibility is tagging those tests manual and run it locally.

yongtang commented 3 years ago

Thanks @vnvo2409 !

For testing: I have created a PR #1184 which setup the AWS Localstack emulator in GitHub Actions. See https://github.com/tensorflow/io/blob/master/tests/test_aws/aws_test.sh for localstack emulator setup.

The test is working for Kinesis dataset in tensorflow-io. I configured the localstack emulator to have s3 and a bare-bone stub test has been created as well as a starting. Please take a look (https://github.com/tensorflow/io/blob/master/tests/test_s3_eager.py)

I will take a look at gcs emulator and see if a similar setup could be done as well.

For issue transfer: Will take a look to see if issue can be transferred here.

For documentation: At the moment the documentation source is inside the docs directory of this repo (https://github.com/tensorflow/io/tree/master/docs). The PRs are reviewed by docs team. PR #1172 is an example.

Once PR is merged the content are being pushed to tensorflow.org website (see https://www.tensorflow.org/io/overview for example).

vnghia commented 3 years ago

For testing: I have created a PR #1184 which setup the AWS Localstack emulator in GitHub Actions.

Thank you!

I will take a look at gcs emulator and see if a similar setup could be done as well.

I think we could use the emulator of google-cloud-cpp, which is the library underlies gcs plugins. The emulator isn't done yet ( It will be done around mid-november ). It is written entirely in python so setup should be quite easy.

I am a little busy working with the emulator of google-cloud-cpp, I will take over this migration when I finish the other work. Is there any timeline for it ?

yongtang commented 3 years ago

@vnvo2409 We don't have a hard timeline for it though we are looking at before TF 2.5 release. Since 2.4 is not released yet (likely in several weeks), the TF 2.5 release will be about early next year. I will try to take a look and see if I can help migrating a plugin or so, and may ask you for review 😄 .

vnghia commented 3 years ago

the TF 2.5 release will be about early next year

Early next year should be enough for me.

I will try to take a look and see if I can help migrating a plugin or so, and may ask you for review 😄 .

Thank you very much 😄

mihaimaruseac commented 3 years ago

I could help with moving issues.

Thank you for making the tracking issue to move the plugins.

yongtang commented 3 years ago

@mihaimaruseac One issue I noticed is that at the moment there is no way to unload the existing file systems embedded in TF. For example, if I am trying to register a shared object plugin for S3, the following error will thrown out because there is already a s3 registered:

/Library/Python/3.8/site-packages/tensorflow/python/framework/load_library.py:178: in register_filesystem_plugin
    py_tf.TF_RegisterFilesystemPlugin(plugin_location)
E   tensorflow.python.framework.errors_impl.AlreadyExistsError: File system for s3 already registered

It looks like the error arises from the code in https://github.com/tensorflow/tensorflow/blob/dc3099c444d294b39cd79fe1d1a4bff59a0c6180/tensorflow/core/platform/env.cc#L82-L90

I am wondering if it makes sense to allow plugin registration to override existing static built file systems (s3/gcs/hdf5)? I think that will allow user to have more flexibilities. (We can lock local, unix file system to never be overridden).

mihaimaruseac commented 3 years ago

We should comment the TF_REGISTER_FILESYSTEM lines in the old code for every filesystem that is integrally moved to plugins. In fact, we can also delete the non-modular implementation.

It will be a risky thing to allow overriding filesystem registrations.

kvignesh1420 commented 3 years ago

@vnvo2409 @mihaimaruseac what is the current approach for testing hdfs plugins in tensorflow?

vnghia commented 3 years ago

AFAIK, we could run hdfs locally in pseudo-distributed mode and it should be enough for testing

kvignesh1420 commented 3 years ago

how about hdfs on docker for automated testing in github actions?

vnghia commented 3 years ago

I find this tutorial. I think it should work. I will try it and will report to you later.

kvignesh1420 commented 3 years ago

Okay. Thanks!

vnghia commented 3 years ago

I would like to bring up the directory behavior of the cloud filesystem. To me, there's no use forcing posix behavior upon cloud filesystem, and it causes some misunderstoods for users ( https://github.com/tensorflow/tensorflow/issues/44884 ). If we can not have a change in this behavior, I think we should be more transparent with TF's users ( A warning or info that directory in cloud filesystem might not work as intented because the way that filesystem stores data is different + link )

yongtang commented 3 years ago

@vnvo2409 I think overall the file systems in tensorflow are not truly file systems that could completely replace standard file systems. Rather they are used to serve two purposes:

  1. The file system (s3/gs) could be used in TF's C++ kernel for reading file content (e.g., tf.io.read_file)
  2. The file system (s3/gs) could be used for saving and loading models. So in addition to reading file content, it also has the ability to get some file from a model directory. From that standpoint standard directory creation/deletion are not truly needed (only partially needed as long as the files under model directory could be created). In other words, I think a better documentation is needed but there is really no need to implement a feature-parity file system that could replace standard file system (or even replace gcs or s3's vendor provided python APIs in boto3 or google.cloud.).
vnghia commented 3 years ago

@yongtang Thank you for your reply ! I will also add it to my PR for the documentation.

vnghia commented 3 years ago

@yongtang Could you sum up what you are doing so our change doesn't overlap with each other ? Thank you.

yongtang commented 3 years ago

@vnvo2409 You can see the list of items that have not been done yet (https://github.com/tensorflow/io/issues/1183#issue-738561022):

Completed:

Not started:

vnghia commented 3 years ago

@yongtang I could not see any BUILD file for the tests directory. Do we have to invoke the tests directly by python or am I missing something ?

kvignesh1420 commented 3 years ago

@vnvo2409 we use pytest for testing the python based API's. For example:

TFIO_DATAPATH=bazel-bin python3 -m pytest -s -v tests/test_serialization_eager.py
vnghia commented 3 years ago

[S3 file system] Enable logging through TensorFlow C API

I think this is already done via a0a4dc7ac7ecdfdc387a9709d5b923a4958c5096

kvignesh1420 commented 3 years ago

@vnvo2409 I think so. Which work item are you planning to start with?

vnghia commented 3 years ago

I am working on s3 test. I think it will be done by tomorrow.

yongtang commented 3 years ago

Thanks @vnvo2409 for the update 👍

vnghia commented 3 years ago

I found a filesystem named gsmemcachedfs. I think it has some what the same idea with the ram_file_block_cache.h in gcs plugins. In addition, it will stop working when we remove gcs filesystem out of tensorflow because it includes some headers from TensorFlow. https://github.com/tensorflow/io/blob/0c9556ab981426a4a6668fee1ceedb2357241d49/tensorflow_io/core/kernels/gsmemcachedfs/memcached_file_system.h#L24

Should we do something with this filesystem ?

yongtang commented 3 years ago

@vnvo2409 Yes I think gsmemcachedfs will need to be ported to C modular file system API. /cc @marioecd

vnghia commented 3 years ago

@yongtang @marioecd I think we should merge the library inside gcs plugins and gsmemcachedfs into one ram_file_block_cache.h and so on. gcs ram_file_block_cache.h is missing something like inheritance cause I am not aware that there is another filesystem depending on it.

I could work on the migration but I might need something instructions for testing and general characters of that filesystem. WDYT ?

vaibhavthapli commented 2 years ago

@mihaimaruseac One issue I noticed is that at the moment there is no way to unload the existing file systems embedded in TF. For example, if I am trying to register a shared object plugin for S3, the following error will thrown out because there is already a s3 registered:

/Library/Python/3.8/site-packages/tensorflow/python/framework/load_library.py:178: in register_filesystem_plugin
    py_tf.TF_RegisterFilesystemPlugin(plugin_location)
E   tensorflow.python.framework.errors_impl.AlreadyExistsError: File system for s3 already registered

It looks like the error arises from the code in https://github.com/tensorflow/tensorflow/blob/dc3099c444d294b39cd79fe1d1a4bff59a0c6180/tensorflow/core/platform/env.cc#L82-L90

I am wondering if it makes sense to allow plugin registration to override existing static built file systems (s3/gcs/hdf5)? I think that will allow user to have more flexibilities. (We can lock local, unix file system to never be overridden).

I got the same error

Sanjaygowda66 commented 2 years ago

@mihaimaruseac One issue I noticed is that at the moment there is no way to unload the existing file systems embedded in TF. For example, if I am trying to register a shared object plugin for S3, the following error will thrown out because there is already a s3 registered:

/Library/Python/3.8/site-packages/tensorflow/python/framework/load_library.py:178: in register_filesystem_plugin
    py_tf.TF_RegisterFilesystemPlugin(plugin_location)
E   tensorflow.python.framework.errors_impl.AlreadyExistsError: File system for s3 already registered

It looks like the error arises from the code in https://github.com/tensorflow/tensorflow/blob/dc3099c444d294b39cd79fe1d1a4bff59a0c6180/tensorflow/core/platform/env.cc#L82-L90

I am wondering if it makes sense to allow plugin registration to override existing static built file systems (s3/gcs/hdf5)? I think that will allow user to have more flexibilities. (We can lock local, unix file system to never be overridden).

Getting same error tensorflow.python.framework.errors_impl.AlreadyExistsError: File system for s3 already registered and AlreadyExistsError: File system for az already registered

wansj commented 2 years ago

@mihaimaruseac One issue I noticed is that at the moment there is no way to unload the existing file systems embedded in TF. For example, if I am trying to register a shared object plugin for S3, the following error will thrown out because there is already a s3 registered:

/Library/Python/3.8/site-packages/tensorflow/python/framework/load_library.py:178: in register_filesystem_plugin
    py_tf.TF_RegisterFilesystemPlugin(plugin_location)
E   tensorflow.python.framework.errors_impl.AlreadyExistsError: File system for s3 already registered

It looks like the error arises from the code in https://github.com/tensorflow/tensorflow/blob/dc3099c444d294b39cd79fe1d1a4bff59a0c6180/tensorflow/core/platform/env.cc#L82-L90

I am wondering if it makes sense to allow plugin registration to override existing static built file systems (s3/gcs/hdf5)? I think that will allow user to have more flexibilities. (We can lock local, unix file system to never be overridden).

I got this error when I tried to import the tensorflow_io module. The versions of tensorflow and tensorflow_io are 2.6 and 0.20.0 respectively. Any solutions?