tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 706 forks source link

Question: Interactive use of TFX and TFRecords? #636

Closed robertlugg closed 4 years ago

robertlugg commented 5 years ago

For moderately sized data, manual work may be wanted. For instance, I might wish to inspect all my training images and delete some of them. Or, I might want to change annotations (that I would have saved in TFRecords). I'm wondering if TFRecords are the right format when I both want scalable workflows and interactive workflows. If TFRecord files can only stream, any interactive operation would like be clumsy. A simple use case would be an image browser where I could look at the next and previous image.

Do I misunderstand the use model for TFRecords? Are they pretty much for streaming only, or could you envision and interactive TFX component?

1025KB commented 5 years ago

TFRecord is just a file format like csv or txt, what do you mean by streaming only? it should be able to work just like all other file format

robertlugg commented 5 years ago

What I really should have described it as is random access versus serial access. Consider the case of an image viewer. I want to take a look at my ML examples. I may wish to add or modify labels. Or, I might want to exclude a particular image for some reason. I might have an image view with a "next" and "previous" button. If access is only serial (like TFRecord, csv, and text), I really can't do that directly. I would need to load into memory or convert to a format which supported random access (a database, for instance).

So let's say I'm working in a jupyter-notebook. While developing, I want to do the operations I described above. I think I would need to first convert TFRecords to a random-access format such as a database. Then my logic would pull from that database. Then I would have to update the TFRecords based on any interactivity I might have done.

While that is possible, it seems really clunky and slow. So, I'm questioning if I'm not thinking about the problem correctly, or if possibly this is just a use model that tfx wasn't designed to handle. Thanks for your thoughts.

1025KB commented 5 years ago

So you would edit a certain tf example instead of the entire dataset if I understand correctly, like a tfrecord editor (similar to txt editor)?

currently we don't have plan to support such use case for file based input data, what we do is basically import all data, convert to tf examples and use it for training.

Transform can be used for tf example modification, but it will apply the same function to all input tf examples (https://www.tensorflow.org/tfx/transform/get_started)

for random access, you can still access it with file io cursor in a "random" way just like how txt editor access txt, or try other format like parquet or big query table which provide a query access

gowthamkpr commented 4 years ago

Closing this issue as it has been answered. Please add additional comments and we can open this issue again