junhuihuang / webpages

1 stars 0 forks source link

What is Data Block API in fastai? #1

Closed junhuihuang closed 1 year ago

junhuihuang commented 2 years ago

Data Block API

Data block API is a high-level API in fastai that is an expressive API for data loading. It is a way to systematically define all of the steps necessary to prepare data for a Deep Learning model, as well as, give users a “mix and match” recipe book to use when combining these pieces.

Think of the Data Block as a list of instructions to do when you’re building batches and DataLoaders: it doesn’t explicitly need any items to be done; instead it is a blueprint of how to operate. In other words, writing a DataBlock is just like writing a blueprint.

Now, we just saw the word DataLoaders, but what is that? Well, PyTorch and fastai use two main classes to represent and access a training or validation set:

Interestingly enough, fastai provides two classes for you to bring your training and validation sets together:

Datasets: An object that contains a training Dataset and a validation Dataset.

DataLoaders: An object that contains a training DataLoader and a validation DataLoader.

The fastai library has an easy way of building DataLoaders so that it is simple enough for someone with minimal coding knowledge to understand, yet advanced enough to allow for exploration.

Steps

There are several steps that need to be followed in order to create data blocks.

The steps are defined by the data block API. They can be asked in the form of questions while looking at the data:

  1. What is the types of your inputs/targets? (Blocks)
  2. Where is your data? (get_items)
  3. Does something need to be applied to inputs? (get_x)
  4. Does something need to be applied to the target? (get_y)
  5. How to split the data? (splitter)
  6. Do we need to apply something on formed items? (item_tfms)
  7. Do we need to apply something on formed batches? (batch_tfms)

This is it!!

You can treat each question or step as a brick that builds the fastai data block:

Looking at the dataset is very important while building dataloaders. And using data block API is the strategy to solve problems. The first thing to look how data is stored, that is in which format or in which manner, and compare to the famous dataset, whether it is stored in that way and how to approach it.

Here, blocks are used to define a pre-defined problem domain. For example, if it’s an image problem, I can tell the library to use Pillow without explicitly saying it; or say it is a single label or multi-label classification. There are many like ImageBlock, CategoryBlock, MultiCategoryBlock, MaskBlock, PointBlock, BBoxBlock, BBoxLblBlock, TextBlock, and so on.

get_items: used to answer where is the data?

For example, in the image problem, we can use get_image_files function to go grab all the file locations of our images and can look at the data.

get_x is the answer to, “does something needs to be applied to inputs?”

get_y is how you extract labels.

splitter is you want to split your data. This is usually a random split between the training and validation dataset.

The remaining two bricks of data block API are item_tfms and batch_tfms:

item_tfms is item transform applied on an individual item basis. This is done on the CPU.

batch_tfms is batch transform applied on batches of data. This is done in GPU.

Using these bricks in the data block, we can approach and build data loaders that are ready for different types of problems like classification, object detection, segmentation, etc.

Data blocks API provides a good balance of conciseness and expressiveness. In the Data Science domain, the scikit-learn pipeline approach is widely used. This API provides a very high-level of expressiveness, but it is not opinionated enough to ensure that a user completes all of the steps necessary to get their data ready for modeling. However, all of this is done in the fastai data block API. https://www.educative.io/edpresso/what-is-data-block-api-in-fastai

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.