[Proposal] Data reading framework for PyTorch (Hive, MySQL, S3 etc.)

pritamdamania87 commented 5 years ago

At Facebook we are building a data reading framework for PyTorch which can efficiently read from data stores like Hive, MySQL, our internal blob store and any other tabular data sources. The framework allows for specifying complex input pipelines to read from different sources. For example if you have a table which stores handles for images, you can write SQL like code to read from the table, apply filters to select certain handles and then retrieve those handles from another data source with a few lines of code.

In addition to this, the framework supports running user-defined transforms which can either be pure python (ex: torchvision.transforms) or torchscript code. This framework can also be used with the torch.distributed package to distribute the data across multiple nodes for training. The input pipeline that the user specifies can be defined once, serialized as a plan and run on multiple remote machines if required.

The framework builds upon the OSS dataloader and dataset framework. In particular it uses IterableDataset to provide a stream based interface for data retrieved from input pipelines.

Sample code to illustrate what reading and pre-processing images would look like:

# Hive table has columns handle and partition_no. The partition column 
# for the table is partition_no
df = data.data_warehouse("mynamespace", "mytable")

# Filter to partition the data across multiple workers.
partition_filter = "hash(partition_no) % {0} = {1}".format(worker_info.num_workers, worker_id)
df = df.filter(partition_filter)

# Fetch the handle from a blobstore
df = df.map(["fetch_handle(handle) as img"])

# Rebatch the data
df = df.rebatch(batch_size=16)

# transform_image is a user supplied function to run image transforms.
ds = MyDataset(df=df, transforms=transform_image)
dl = torch.utils.data.DataLoader(ds)

for batch in dl:
    pass

We are evaluating whether it makes sense to open source this framework. For OSS users, this framework might be useful for training jobs which store large amount of data in Hive or S3 (images). Although, we would love to hear from the community whether this would be useful and also some use cases that might benefit from a framework like this.

@dzhulgakov @jspisak @aartibasant @apaszke @SsnL

Kaixhin commented 5 years ago

All sounds useful. Even for single machines, providing an API to help read tabular data from CSV files or small databases would a) be very useful for data already in those forms b) make it more attractive for people to store data using those forms as opposed to writing more ad-hoc solutions.

pritamdamania87 commented 5 years ago

@Kaixhin Thanks a lot for the feedback! Do you know what folks do currently to read data from CSV files/databases? Also, what are some examples of ad-hoc solutions that have been employed in the past?

Kaixhin commented 5 years ago

I assume many people would write their own data loaders using Python libraries, but in terms of PyTorch-specific support probably the most popular library out there for this is fastai. Ad-hoc storage might be saving items as torch Tensors or lists/dicts of these (this is pretty convenient, but not a portable solution).

jaliyae commented 5 years ago

We added chunk based dataloading support to libtorch. https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/data/datasets/chunk.h

We used high level python binding to expose this to python keeping the entire dataloading pipeline in C++. Perhaps something to consider.

bencherian commented 5 years ago

@jaliyae Do you have any documentation on how to use this chunk based data loading feature? Is the python binding you are referring to already in PyTorch or an internal implementation you all have created for use with PyTorch?

Also, wrt to the original topic, a generic method of loading blobs from S3 would definitely be nice to have.

jaliyae commented 5 years ago

@bencherian If you look at the CanAccessChunkSamplerWithChunkDataSet test in this file, it will provide a good example. Currently we don't have a complete documentation and hopefully trying to get the python binding to PyTorch as well. Currently what we did was to pybind a method like get_data_loader() which produce an iterator.

ezyang commented 5 years ago

@cpuhrsch says: we really don't have anything about data reading in PyTorch at all.

Microsheep commented 5 years ago

I think this will be very useful for many use-cases!

We have several research projects that read data from SQL Databases in the lab. We currently use custom wrote dataset classes to read those data using sqlalchemy, and it is not that easy to get everything right and efficient. (Also, considering that we might want multiple processes to do the data loading, it makes things even more complicated)

Let me paint the scenario that we are currently in. We have 100~500K+ data instances that have 60K+ floating points each. (Not images) For each data instance, we have 10+ columns of metadata that we might want to use. When constructing a dataset, we might want to limit the data read by some metadata constraint. (Training/Testing split that has limitations Eg. if we want stratification split for the training process; Filtering out clean data that is labeled using taint_codes) The data instance count grows through time as we get more and more data.

One way to store these data will be to store 100K separate files and a separate CSV for metadata. Another way is to have a single SQLite database with the floating-point numbers stored in a binary column and metadata in other columns. The latter will be much cleaner and better for long term maintenance. Storing the data in a binary column might not be that bad an idea comparing to storing in lots of separate files when the dataset evolves through time, new data and data fixes are constantly happening. When we want to update the training dataset, we simply update the database. When we want to move the dataset, it is an easy task of moving a single database file. (Moving 100K+ small files is really slow...)

We also pack and compress the data using pack+zlib in python before storing it in the database, this reduces the overall data size 5~10X. This reduction in data size makes the pressure on Disk I/O much less, and with Linux file-cache (vmtouch), we might see that the whole SQLite database being cached into the system memory, which is even better. The tradeoff, however, is that there will be some decompression happening when loading the data, but CPUs calculations are usually fast compared to Disk I/O.

The IterableDataset will solve some problems we previously have when we want to have multiple data loading workers, database connections shall not be shared across processes. Some hacky solutions use os.getpid() and checks on the PID to determine the correct connection to use but is not really a good way to do this.

Overall, I think a more robust way of loading data from databases or datastores like s3 will be really helpful and guide the way to make Pytorch better!

pritamdamania87 commented 5 years ago

cc @manojkris @aartibasant

jaliyae commented 5 years ago

cc @thiagocrepaldi

DXist commented 5 years ago

Another use case - dataset versioning. We are interested in incremental image dataset preparation and ability to reproduce experiments with given dataset version. And recycle old versions. A database like SQLite would be handy to track versions.

thiagocrepaldi commented 5 years ago

We have been working in something similar and we just pushed a first version for discussion of API, use cases, etc: https://github.com/pytorch/pytorch/pull/26547

It builds on ChunkDataset API originally implemented for C++ (ported to Python in this PR) and the recent IterableDataset. Users can leverage the chunk concept to distribute data across workers with proper randomization and smooth integration with current DataLoader. Transforms can be applied using collate function or even at the reader level.

@jaliyae @xzhu1900 @kit1980

frkl commented 5 years ago

We have projects dealing with annotations from multiple datasets, e.g. coco, vqa, visual genome, coco-stuff. Annotations from multiple datasets need to be merged before going to the model. Some images have annotations from all datasets, and some other images may have only 1 type of annotations.

In current public implementations, merging multiple datasets into 1 and building dataloaders take a lot of effort and many, many for loops, mostly just matching image ids. I recently found that organizing datasets into SQL tables and do merges/queries greatly reduces the amount of code I have to write and that probably saved a lot of my hairs.

In fact something like COCO annotations {“license”: 5,“file_name”: “COCO_train2011 4_000000057870.jpg”,“coco_url”: “https://images_cocodataset_org/train2014/CC OCO_train2014_000000057870.jpg”,“height”: 480,“width”: 640,“date_captured”" : “2013-11-14 16:28:13”,“flickr_url”: “https://farm4_staticflickr_com/3153// 2970773875_164f0c0b83_z.jpg”,“id”: 57870}, …

can be represented as a SQL table with fields “license”, “file_name”, “coco_url”, “height”, “width”, etc. Merging annotations from two different datasets can be done by SQL joins based on image names/indexes. Then dataloaders could be constructed on SQL tables to process and return rows of a data table.

We spent an afternoon hacking and wrote a table class that's basically a dictionary of data columns, and a database class that's basically a dictionary of tables. You can say table.cuda() and the tensor columns in the table will be sent to gpu. We also mock a few SQL operations like joining and filtering. The result is a dataloader class with no for loops.

I think PyTorch could have a standard implementation for a mock SQL DB, something like “torch.db” to do this. Traditional databases can’t efficiently handle GPU tensors, so an opportunity for PyTorch is to enable fast joins of tables with tensors and potentially optimally manage GRAM/RAM/disk access.

soumith commented 5 years ago

@frkl am I wrong to think that it might be overlapping a lot with RAPIDS from NVIDIA?

frkl commented 5 years ago

@frkl am I wrong to think that it might be overlapping a lot with RAPIDS from NVIDIA?

Thanks for pointing out. Just checked and indeed that looks like what we need -- pandas with gpu tensors. Pandas with pytorch tensors could save a few more lines but numpy is used for interoperatability.

Hans0124SG commented 5 years ago

If I am reading time-series numerical and text data from the local SQL server through psycopg2 connection to my customized Dataset, and I would like to use DataLoader with more than 1 num_worker, I will get errors like SSL Error due to the fact that I have only one connection to the database.

May I know in the current framework, is there any good solution to this? If not, I think a data reading framework targeting data stored in the databases will be extremely useful.

thiagocrepaldi commented 5 years ago

If I am reading time-series numerical and text data from the local SQL server through psycopg2 connection to my customized Dataset, and I would like to use DataLoader with more than 1 num_worker, I will get errors like SSL Error due to the fact that I have only one connection to the database.

May I know in the current framework, is there any good solution to this? If not, I think a data reading framework targeting data stored in the databases will be extremely useful.

Maybe you can use the DataLoaders worker_init_fn to create new connections for each worker

Hans0124SG commented 5 years ago

If I am reading time-series numerical and text data from the local SQL server through psycopg2 connection to my customized Dataset, and I would like to use DataLoader with more than 1 num_worker, I will get errors like SSL Error due to the fact that I have only one connection to the database. May I know in the current framework, is there any good solution to this? If not, I think a data reading framework targeting data stored in the databases will be extremely useful.

Maybe you can use the DataLoaders worker_init_fn to create new connections for each worker

Thank you for your kind reply! However, I think worker_init_fn won't allow me to return the connection instance which will be used in my Dataset and subsequently pd.read_sql for example :(

ssnl commented 5 years ago

@Hans0124SG You can use torch.utils.data.get_worker_info() to manipulate your dataset object in worker_init_fn.

gourav-sg commented 4 years ago

At Facebook we are building a data reading framework for PyTorch which can efficiently read from data stores like Hive, MySQL, our internal blob store and any other tabular data sources. The framework allows for specifying complex input pipelines to read from different sources. For example if you have a table which stores handles for images, you can write SQL like code to read from the table, apply filters to select certain handles and then retrieve those handles from another data source with a few lines of code.

In addition to this, the framework supports running user-defined transforms which can either be pure python (ex: torchvision.transforms) or torchscript code. This framework can also be used with the torch.distributed package to distribute the data across multiple nodes for training. The input pipeline that the user specifies can be defined once, serialized as a plan and run on multiple remote machines if required.

The framework builds upon the OSS dataloader and dataset framework. In particular it uses IterableDataset to provide a stream based interface for data retrieved from input pipelines.

Sample code to illustrate what reading and pre-processing images would look like:
# Hive table has columns handle and partition_no. The partition column 
# for the table is partition_no
df = data.data_warehouse("mynamespace", "mytable")

# Filter to partition the data across multiple workers.
partition_filter = "hash(partition_no) % {0} = {1}".format(worker_info.num_workers, worker_id)
df = df.filter(partition_filter)

# Fetch the handle from a blobstore
df = df.map(["fetch_handle(handle) as img"])

# Rebatch the data
df = df.rebatch(batch_size=16)

# transform_image is a user supplied function to run image transforms.
ds = MyDataset(df=df, transforms=transform_image)
dl = torch.utils.data.DataLoader(ds)

for batch in dl:
    pass
We are evaluating whether it makes sense to open source this framework. For OSS users, this framework might be useful for training jobs which store large amount of data in Hive or S3 (images). Although, we would love to hear from the community whether this would be useful and also some use cases that might benefit from a framework like this.

@dzhulgakov @jspisak @aartibasant @apaszke @SsnL

this feature will be absolutely wonderful, we have large volumes of data either stored in parquet or kinesis or kafka or elasticsearch, which makes this feature absolutely wonderful to have. Otherwise I do not quite understand how Pytorch community exactly says that this is a production ready software

jaliyae commented 4 years ago

Instead of recreating, why not improve/integrate this https://github.com/uber/petastorm

gourav-sg commented 4 years ago

Instead of recreating, why not improve/integrate this https://github.com/uber/petastorm

does not make any sense at all. We should not be creating massive technical overheads just for reading data from object stores (S3, HDFS, etc) and SQL databases. For example, if I am running Kafka SQL to get me the latest images recorded off an IoT device, and train and respond in real time, I want this functionality to be there in Pytorch API's directly.

zhangruiskyline commented 4 years ago

Hi, I would like to know is there any good lib for pytorch to load data from a Cloud HDFS file system?

gourav-sg commented 4 years ago

Hi Rui,

from what I understand there is not and yet I keep on hearing that Pytorch is used for large scale AI platforms? Does sound a bit confusing is it not?

Yet you can see that Tensorflow does have those libraries for S3: https://apimirror.com/tensorflow~guide/deploy/s3 which is another object store.

The answer from the Pytorch community has been we should write our own readers for loading data, and as a beginner I am thinking "why the should I not use Tensorflow instead?". I am more interested in delivering AI projects and not writing libraries like Pytorch, Lua, or Tensorflow or their corresponding plugins.

Hopefully someone will listen and start working on this.

Regards, Gourav Sengupta

On Sun, Jan 26, 2020 at 8:57 AM Rui Zhang notifications@github.com wrote:

Hi, I would like to know is there any good lib for pytorch to load data from a Cloud HDFS file system?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822?email_source=notifications&email_token=AAJZLQZV6FKVFEEQOHNQMC3Q7VF6LA5CNFSM4HOXLEDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ5PF4I#issuecomment-578482929, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ27F6F2C63J6YAYKVTQ7VF6LANCNFSM4HOXLEDA .

soumith commented 4 years ago

does not make any sense at all. We should not be creating massive technical overheads just for reading data from object stores (S3, HDFS, etc) and SQL databases. For example, if I am running Kafka SQL to get me the latest images recorded off an IoT device, and train and respond in real time, I want this functionality to be there in Pytorch API's directly.

Yet you can see that Tensorflow does have those libraries for S3: https://apimirror.com/tensorflow~guide/deploy/s3 which is another object store.

@gourav-sg the whole notion that PyTorch is some special snowflake in the Python ecosystem, and it requires new APIs or tooling to read a pickle (of Tensors or ndarrays) from S3 or Parquet seems pretty wrong, and questionably misguided.

There are mature Python APIs around reading from HDFS or Parquet or S3. Ultimately, one wants to write a generator that yields a Tensor.

Yes, TensorFlow made a whole new API from scratch for it, and with TFRecords and stuff, but it is literally to read a TensorFlow format into a TensorFlow Tensor.

But if there is mature Python API that reads data from Datastores and Cloudstores and returns it as a NumPy ndarray or a Pandas dataframe, we absolutely be leveraging that, not really sit and redesign something from scratch, because it's somehow not "PyTorch API". Converting a NumPy array to a PyTorch Tensor is zero mem-copy and zero overhead.

soumith commented 4 years ago

in some fast-forwarded time, if and when @pritamdamania87 and team decide to open-source their Data pipeline framework (and there are similar efforts, as @jaliyae pointed out such as petastorm), one has to think about whether to yield NumPy ndarrays or PyTorch Tensors, depending on where the open-source user-base would find this generally useful.

soumith commented 4 years ago

I took a look at the petastorm API, and it looks pretty reasonable:

from petastorm import make_reader
from petastorm.pytorch import DataLoader

with DataLoader(make_reader('hdfs://myhadoop/some_dataset', num_epochs=10,
                            transform_spec=transform), batch_size=64) as train_loader:
    with data, label in train_loader:
        # training loop

It has the following expected features from a high-performant production-ready data / object store reader:

Selective column readout
Multiple parallelism strategies: thread, process, single-threaded (for debug)
N-grams readout support
Row filtering (row predicates)
Shuffling
Partitioning for multi-GPU training
Local caching

zhangruiskyline commented 4 years ago

Thanks @soumith , let me check petastorm , if it can load from a Cloud based HDFS file system like S3, Azure data lake, then should be good enough, I just want to have sth similar like TF's https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.cc, so we do not need to load large data set from HDFS file system into local disk and then perform the training in minibatch, rather the data loader can treat any cloud based or remote file system as local disk and perform whatever action

gourav-sg commented 4 years ago

Hi Soumith,

thanks a ton for responding !!!!

I had expressed by frustration to the community earlier but the response was "why dont you write your own data loaders". As you can understand building a production level input pipelines are non-trivial and needs to involve a lot of engineering around resiliency, scalability, distribution, recoverability, and so on.

Sadly most of my clients are neither funded by taxpayer nor are charitable organizations, and therefore their market realities are pretty damn harsh. We have to demonstrate AI investments are quantifiable, profitable, and long lived often over very small margin of profit. So:

solutions have to scale
solutions have to be responsive
with less technical debt
and be cost effective at the least

Do you think that Petastorm will be supported by the Pytorch community in general, and later on integrated? The ability to transform data from objects stores, or SQL (JDBC) into tensors will be a mandatory requirement today.

Thanks and Regards, Gourav Sengupta (PS: I am sorry in case some of these contents seem direct, I just wanted the requirements to come out as a matter of fact)

On Sun, Jan 26, 2020 at 11:33 PM Soumith Chintala notifications@github.com wrote:

I took a look at the petastorm API, and it looks pretty reasonable:

from petastorm import make_readerfrom petastorm.pytorch import DataLoader with DataLoader(make_reader('hdfs://myhadoop/some_dataset', num_epochs=10, transform_spec=transform), batch_size=64) as train_loader: with data, label in train_loader:

training loop

It has the following expected features from a high-performant production-ready data / object store reader:

Selective column readout

Multiple parallelism strategies: thread, process, single-threaded (for debug)

N-grams readout support

Row filtering (row predicates)

Shuffling

Partitioning for multi-GPU training

Local caching

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822?email_source=notifications&email_token=AAJZLQ7SF4NVP4JKWNQ24J3Q7YMTZA5CNFSM4HOXLEDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ6AWLA#issuecomment-578554668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ7TMLWGYYDLX2SPH4DQ7YMTZANCNFSM4HOXLEDA .

soumith commented 4 years ago

@gourav-sg pretty much all of the bullet points that you gave sound more like standard high-level client requirements, I think you can make them whether you are picking Pandas or Spark -- i.e. regardless of the nature of the project. And, if you are building custom production pipelines, you likely have to do non-trivial engineering work regardless of the product you pick.

Do you think that Petastorm will be supported by the Pytorch community in general, and later on integrated? The ability to transform data from objects stores, or SQL (JDBC) into tensors will be a mandatory requirement today.

Looks like Petastorm already supports clean PyTorch integration. I don't see what else one has to do to integrate it further -- we wont be swallowing Petastorm into the PyTorch codebase, it doesn't make any sense for modularity. About support itself, unlike Databricks or Cloudera, we don't have a Commercial Support arm, so yes we will endorse good projects in the PyTorch community by carefully vetting them, but we won't have a group of engineers doing client support in exchange for money. Our work on vetting pytorch-based projects carefully for quality is consolidated in this page: https://pytorch.org/ecosystem/

I think the disconnect between your thinking and ours is that, we see PyTorch as a library that integrates with other libraries as equal partners, and people use these freely to build their solutions. Whereas you see PyTorch as a complete AI solution, with various official / well-supported integrations and a good commercial support arm. It is not the latter, unless someone starts a company around commercializing PyTorch. From the limited knowledge I have, TensorFlow isn't that project either, though they put a lot more emphasis on having a TensorFlow API for everything (with less emphasis on reusing other projects' APIs).

So, with that disconnect in place, do choose what suits best for your clients. If they really want a commercial-grade endorsement and guarantees (sort of how one used to say "you never get fired for buying IBM"), I am not sure PyTorch at this stage of it's lifetime is the right product for your clients. If they perceive TF as being that, then that's fine. Maybe PyTorch will get there some day, organically, but it's not in our strategic priorities to exactly be that.

When we say PyTorch is production-ready, it really is -- used by a different set of consumers. It is used by large and small software companies {FB, Uber, Tesla, Matroid, etc.} who build their own end-to-end solutions for serving / inference.

zhangruiskyline commented 4 years ago

Thanks for discussion, I think it will be very useful if pytorch community can provide unified data loader on different file system underneath, S3, Azure Data Lake, HDFS, etc(just requires some different config, not another lib like petastorm). and the API to read data will be transparent for application level regardless whether data is local disk or remote/distributed file system. @soumith

gourav-sg commented 4 years ago

Hi Soumith,

once again thanks at million times, your response is much appreciated.

Let me start using Petastorm in that case.

Like I mentioned small and medium sized companies, do not have the kind of budget that charitable organizations, or government funded/ backed organizations have. There are commercial companies like Tesla ofcourse, but their solutions are anyways too bespoke to be used by larger user base.

If I am working for a company with a funding of 5 million pounds, or a SME with a turn over of around 10 million pounds, they really do not have budget for a team of 10 people sitting and writing data ingestion software for them, and then supporting and maintaining them. I think that most of the companies and users fall in that category. Writing production level, distributed, scalable, resilient, and recoverable data ingestion to tensors is intensive and not cost effective for such companies. Is there a way that FB, or Tesla, could give more insights into how they use their data stored in HDFS or distributed storage or databases to be used in Pytorch?

Let me start using Petastorm and come back to you. Once again, thank you so very much. Its super nice to get an authoratative and closed ended answer.

Thanks and Regards, Gourav Sengupta

On Mon, Jan 27, 2020 at 3:34 PM Soumith Chintala notifications@github.com wrote:

@gourav-sg https://github.com/gourav-sg pretty much all of the bullet points that you gave sound more like standard high-level client requirements, I think you can make them whether you are picking Pandas or Spark -- i.e. regardless of the nature of the project. And, if you are building custom production pipelines, you likely have to do non-trivial engineering work regardless of the product you pick.

Do you think that Petastorm will be supported by the Pytorch community in general, and later on integrated? The ability to transform data from objects stores, or SQL (JDBC) into tensors will be a mandatory requirement today.

Looks like Petastorm already supports clean PyTorch integration. I don't see what else one has to do to integrate it further -- we wont be swallowing Petastorm into the PyTorch codebase, it doesn't make any sense for modularity. About support itself, unlike Databricks or Cloudera, we don't have a Commercial Support arm, so yes we will endorse good projects in the PyTorch community by carefully vetting them, but we won't have a group of engineers doing client support in exchange for money. Our work on vetting pytorch-based projects carefully for quality is consolidated in this page: https://pytorch.org/ecosystem/

I think the disconnect between your thinking and ours is that, we see PyTorch as a library that integrates with other libraries as equal partners, and people use these freely to build their solutions. Whereas you see PyTorch as a complete AI solution, with various official / well-supported integrations and a good commercial support arm. It is not the latter, unless someone starts a company around commercializing PyTorch. From the limited knowledge I have, TensorFlow isn't that project either, though they put a lot more emphasis on having a TensorFlow API for everything (with less emphasis on reusing other projects' APIs).

So, with that disconnect in place, do choose what suits best for your clients. If they really want a commercial-grade endorsement and guarantees (sort of how one used to say "you never get fired for buying IBM"), I am not sure PyTorch at this stage of it's lifetime is the right product for your clients. If they perceive TF as being that, then that's fine. Maybe PyTorch will get their some day, organically, but it's not in our strategic priorities to exactly be that.

When we say PyTorch is production-ready, it really is -- used by a different set of consumers. It is used by large and small software companies {FB, Uber, Tesla, Matroid, etc.} who build their own end-to-end solutions for serving / inference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822?email_source=notifications&email_token=AAJZLQ5BKMAEL7DXESIJU43Q735I3A5CNFSM4HOXLEDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ753PA#issuecomment-578805180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ2XZJONES3QCFSEK5TQ735I3ANCNFSM4HOXLEDA .

zhangruiskyline commented 4 years ago

Hi @soumith , want to follow up on this thread. TF has a I/O wrapper model https://www.tensorflow.org/api_docs/python/tf/io/gfile , although it is bundled with TF, I believe this is general IO wrapper, anyway Pytorch can integrate or provide similar stuff?

soumith commented 4 years ago

@zhangruiskyline Python's built-in io package maybe?

tmbdev commented 4 years ago

We've built a number of tools for petascale training at NVIDIA. These are built with easy transition from file-based datasets in mind and are based on widely used, open standards. In particular, datasets are stored as POSIX tar archives, meaning that they are stored in bit-identical form to the way they were stored on disk and that existing PyTorch augmentation pipelines work without change.

We achieve I/O performance limited only by available disk and network bandwidth; that is, if you add 1000 drives to your network and have the necessary network connectivity, you get the full bandwidth of all 1000 drives delivered wherever you want to. In addition, these tools give good performance even on cheap rotational drives.

The components are:

WebDataset -- an implementation of the IterableDataset interface that can read/write POSIX tar archives and use them for training and evaluation; it can work both on the local file system, over HTTP, or with S3 http://github.com/tmbdev/webdataset
AIStore -- a highly scalable distributed cache and object store for AI applications; in addition to allowing linear scaling for very large datasets, it can also perform map-reduce-style operations on the storage server http://github.com/NVIDIA/AIStore
tensorcom -- a "middle layer" that enables large scale distributed decoding and augmentation, as well as RDMA to GPU (think "num_workers=1000") http://github.com/NVlabs/tensorcom
tarp -- a small command line utility for easy ETL and dataset transformations from the command line http://github.com/tmbdev/tarp

The initial project driving the development of these tools has been self-supervised training on petabytes of videos, as well as large scale OCR, but the nice thing about these tools is that they "scale down" as well. You can read sharded tar files just from disk, or put datasets on a web server and train against that. All of this works well with Docker, Kubernetes, and Cloud providers.

I thought I'd alert people to the availability of these tools. In fact, we would hope that WebDataset might even become a standard part of PyTorch (it's not a big library) @soumith . There is a lot more documentation to be written, and we'd like to give people some useful "starter datasets" and "starter projects". But have a look and see whether it sounds like it is useful for your needs, and I'd be happy to help you out.

gourav-sg commented 4 years ago

Hi,

I do not quite get this, production loads happens in cloud, and data is mostly stored in columnar format like parquet, or in case of images, in binary format. which can be tarred, but they are still available in the cloud.

This enables distributed GPU based training possible.

So far there is no support from the Pytorch community on these real life production issues, unless you take some frameworks from Uber and have super human abilities in making it work.

Pytorch is more research focused, it is like you go to BOSCH and they tell you "we only manufacture the engines", and others use it like BMW. But then you need to have multi billion pounds commitment just to be able to load data or build a car around the engine.

So Pytorch is production ready just like a V8 engine is, the only clause is that you build your car around the engine.

Regards, Gourav Sengupta

On Mon, Apr 27, 2020 at 7:32 AM Tom notifications@github.com wrote:

We've built a number of tools for petascale training at NVIDIA. These are built with easy transition from file-based datasets in mind and are based on widely used, open standards. In particular, datasets are stored as POSIX tar archives, meaning that they are stored in bit-identical form to the way they were stored on disk and that existing PyTorch augmentation pipelines work without change.

We achieve I/O performance limited only by available disk and network bandwidth; that is, if you add 1000 drives to your network and have the necessary network connectivity, you get the full bandwidth of all 1000 drives delivered wherever you want to. In addition, these tools give good performance even on cheap rotational drives.

The components are:

WebDataset -- an implementation of the IterableDataset interface that can read/write POSIX tar archives and use them for training and evaluation; it can work both on the local file system, over HTTP, or with S3 http://github.com/tmbdev/webdataset https://github.com/tmbdev/webdataset

AIStore -- a highly scalable distributed cache and object store for AI applications; in addition to allowing linear scaling for very large datasets, it can also perform map-reduce-style operations on the storage server http://github.com/NVIDIA/AIStore https://github.com/NVIDIA/AIStore

tensorcom -- a "middle layer" that enables large scale distributed decoding and augmentation, as well as RDMA to GPU (think "num_workers=1000") http://github.com/NVlabs/tensorcom https://github.com/NVlabs/tensorcom

tarp -- a small command line utility for easy ETL and dataset transformations from the command line http://github.com/tmbdev/tarp https://github.com/tmbdev/tarp

The initial project driving the development of these tools has been self-supervised training on petabytes of videos, as well as large scale OCR, but the nice thing about these tools is that they "scale down" as well. You can read sharded tar files just from disk, or put datasets on a web server and train against that. All of this works well with Docker, Kubernetes, and Cloud providers.

I thought I'd alert people to the availability of these tools. In fact, we would hope that WebDataset might even become a standard part of PyTorch (it's not a big library) @soumith https://github.com/soumith . There is a lot more documentation to be written, and we'd like to give people some useful "starter datasets" and "starter projects". But have a look and see whether it sounds like it is useful for your needs, and I'd be happy to help you out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822#issuecomment-619758598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ6K6SLWDZRI7UIIJ5LROURILANCNFSM4HOXLEDA .

tmbdev commented 4 years ago

@gourav-sg I agree that the divide between production and research exists when it comes to large problems, and that software capable of dealing with petascale datasets has been difficult to deploy. Part of the reason is that existing big data tools and formats (Hadoop, HDFS, Parquet, TFRecord, etc.) were developed for different compute environments and different use cases from deep learning.

We wanted something simpler than these solutions, something that would allow a researcher to deal with very large datasets without needing an IT or engineering staff, without learning complex new libraries and concepts, and without having to make a lot of changes to their existing code.

For example, to run a training job over YT8m (a few hundred TB), all you do is put the data on a web server as tar files. The data itself is just stored as regular tar files (in this case, .mp4 files).

import webdataset as wds
...
url = "http://videobox/yt8m/clips30s-{000000..002999}.tar"
dataset = wds.Dataset(url).decode().map_dict(mp4=mp4decode).to_tuple("mp4", "cls")
loader = DataLoader(dataset, batch_size=64, num_workers=16)

The code doesn't change when you put data on local file systems, or in cloud buckets, or in BitTorrent. For development and testing, you can run against a small number of shards on the file system. If you need higher performance than a regular web server, you can run AIStore on whatever hardware you have (AIStore is a distributed web server and web cache with replication and erasure coding, implemented as a single, self-contained executable and with P2P capabilities).

For map-reduce style data preprocessing jobs, you have the option of using a small tool called tarp to run simple command line scripts and tools (e.g., ffmpeg, ImageMagick, etc.) over each shard, and your favorite job queuing system (including K8s) to map all shards in parallel. (Of course, you can also simply untar each shard, process the files as you usually would, and tar it back up again.)

tmbdev commented 4 years ago

By the way, here is an evaluation of AIStore performance and scalability: https://arxiv.org/abs/2001.01858 This includes a comparison with HDFS.

You don't have to use AIStore with WebDataset (anything that speaks HTTP will do), but if you want something scalable and easy to deploy, it's a good choice.

PCerles commented 4 years ago

@tmbdev awesome work 👍

gourav-sg commented 4 years ago

Hi Tom,

definitely this goes without saying that yours is a great piece of work, thanks a ton for the wonderful help.

Most of the small to medium size companies, like the clients I work with, find it too expensive to build production scale data loaders (properties of a production data loader has been explained earlier). The technical debt and time to value is just to high.

We belong to those kind of market segment which will prefer to buy a car than build a factory to put an car around an engine.

Elsewhere they are selling cars: https://www.tensorflow.org/datasets/gcs https://www.swiftstack.com/blog/2019/07/09/tutorial-loading-data-into-tensorflow-via-s3-api/ https://github.com/cdatainc/tensorflow-odbc https://www.tensorflow.org/io/api_docs/python/tfio

The sad part is I am in love with PyTorch.

Also Tom, it is a complete misunderstanding that cloud storages cannot be used for deep learning based work. That is not how the industry works, unless you are using Pytorch ofcourse.

Regards, Gourav Sengupta

On Mon, Apr 27, 2020 at 5:17 PM PCerles notifications@github.com wrote:

@tmbdev https://github.com/tmbdev awesome work 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822#issuecomment-620086934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ57VDGBLBEHZABCGPLROWVX5ANCNFSM4HOXLEDA .

tmbdev commented 4 years ago

@gourav-sg We regularly run PyTorch training in the cloud using WebDataset (and have been for a couple of years). That's one of its primary use cases. It literally is as simple as changing "/dataset" to "gs://bucket/dataset".

In addition, you can use AIStore as a caching proxy. That is, if your master dataset is stored in Google Cloud, Azure, or S3, you can configure AIStore on premises to cache the data locally as you train. You get all the benefits of cloud storage with the performance of local storage.

AIStore will also act as an S3 server for Tensorflow's S3-based I/O primitives, and it will (in the next release) perform on-the-fly conversions from WebDataset to/from TFRecord, for local and for cloud data.

tmbdev commented 4 years ago

@gourav-sg Tensorflow's I/O architecture is much more complex than PyTorch's because of Tensorflow's original design around graph execution and multithreading. That's why there are special support libraries for all sorts of I/O systems. But what useful functionality does tensorflow-odbc or tfio.BigTable give you that Python ODBC and Python BigTable doesn't give you out of the box? What would you expect equivalent PyTorch libraries to actually do?

gourav-sg commented 4 years ago

Hi,

I think that this thread is turning too long now and will prefer to keep my response limited.

What I am trying to say is simple, we cannot assume that deep learning algorithms are only useful for data sets that are stored in tar files. That is misleading.

Production scale data loaders need to have properties which requires a lot of serious engineering and investments.

Therefore the ask is simple, can we have data loaders prebuilt in Pytorch so that we can run multi GPU distributed clusters based training as production workloads demands?

Sorry, but I am expecting individuals who have actually build, run and scaled industry solutions (without spending millions in creating data loaders) to be understanding the pain points and responding to me. When you have a client SLA to be met and your pain point is that your engineer left you with a data loader technical debt which you cannot resolve or get support of, a research paper does not come to help against legal law suits for critical failed production targets.

I am in the market, and have to face clients who have to make their investments profitable continuously year on year, that is the deciding factor for choosing between PyTorch or MXNet or TensorFlow.Personally you cannot stop loving PyTorch, but professionally I have to meet my clients profitability goals.

I think that Soumith at least got my question and asked me to use Uber's library which I found quite painful to use in production.

Thanks and Regards, Gourav Sengupta

On Mon, Apr 27, 2020 at 8:40 PM Tom notifications@github.com wrote:

@gourav-sg https://github.com/gourav-sg Tensorflow's I/O architecture is much more complex than PyTorch's because of its original design around graph execution and multithreading. That's why there are special support libraries for all sorts of I/O systems. But what useful does tensorflow-odbc or tfio.BigTable give you that Python ODBC and Python BigTable doesn't give you out of the box? What would you expect equivalent PyTorch libraries to actually do?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/20822#issuecomment-620192566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJZLQ4EBWWUMTT2TKBBZ5LROXNS7ANCNFSM4HOXLEDA .

tmbdev commented 4 years ago

Therefore the ask is simple, can we have data loaders prebuilt in Pytorch so that we can run multi GPU distributed clusters based training as production workloads demands?

Well, we run jobs with hundreds of nodes and petabytes of data using the tools I describe. We achieve completely linear scalability and full I/O bandwidth utilization across all drives. We demonstrably get better scaling and better I/O utilization than HDFS.

I think that Soumith at least got my question and asked me to use Uber's library which I found quite painful to use in production.

And because we found those existing systems to be "quite painful in production", we developed something that was simpler and less painful. Part of what makes Petastorm "painful" is its use of Parquet and its reliance on PySpark and HDFS, plus all the consequences that entails. That's why we decided to use a different container format (POSIX tar), a different object storage protocol (HTTP and S3), and a fundamentally different approach to running map-reduce jobs.

Maybe you expect systems that operate on petabytes of data to be necessarily huge and complex, but our system is fast and scalable precisely because it is minimalist and simple.

The fact that we support on-the-fly conversion to TFRecord also means that you don't have to make a choice between PyTorch and Tensorflow (something that benefits PyTorch in production settings). And, of course, it's free and open source.

In any case, it's just an option. If it doesn't suit your needs, by all means, use something else. But don't assume that it's a toy because it looks simple.

elgalu commented 3 years ago

@tmbdev webdataset and AIStore look really promising, thanks for sharing this work. I noticed that the paper doesn't mention Lustre but probably AIStore is equally performant?

hungj commented 9 months ago

@pritamdamania87 any update on this? We're looking for ways to read Hive tables into Pytorch, this feature sounds really useful!

pytorch / pytorch

[Proposal] Data reading framework for PyTorch (Hive, MySQL, S3 etc.) #20822

training loop