tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.15k stars 632 forks source link

The meaning of the flag "use_datasets" is confusing. #461

Open gangmuk opened 4 years ago

gangmuk commented 4 years ago

The meaning of the flag "use_datasets" is confusing.

Below two flags are flags "use_datasets" and one more which is related to it named "datasets_use_prefetch".

flags.DEFINE_boolean('use_datasets', True, 'Enable use of datasets for input pipeline' flags.DEFINE_boolean('datasets_use_prefetch', True, 'Enable use of prefetched datasets for input pipeline. ' 'This option is meaningless if use_datasets=False.')

At the first time, I thought 'use_datasets' is for choosing between synthetic data and real dataset. But apparently, It doesn't look like that from my own observation.

So here is what I did and the situation I am encountering.

  1. I set 'datasets_use_prefetch' False

  2. I set 'use_datasets' True

  3. Run Resnet50 model with batch size 128, with 100 training iterations (or 100 steps in other words)

  4. For the first time running, disk i/o goes up. More specifically, read operation goes up around 20MB/s.

  5. AND AGAIN, after first running finished, I run again the same model with the same flag setting but for 200 training iterations.

  6. For the first 100 iterations, Read operation doesn't go up. However, after 100 iteration meaning from 101 iterations, Read operation goes up again.

From this observation, I am guess now if I set 'use_datasets' flag True, then it stores already read dataset somewhere in disk and you bring them all from disk to main memory before starting the training iteration.

Am I understanding correctly? If I do, then What is the difference from 'datasets_use_prefetch' flag??

It looks complicated questions but essence is I don't understand flag definition fully ;) Thank you in advance.

reedwm commented 4 years ago

Note that tf_cnn_benchmarks is no longer maintained, and I recommend you look at the official models instead.

To answer your question, --use_datasets simply causes the tf.data API to be used instead of a deprecated RecordInput API. As you realized, this has nothing to do with synthetic and real data, and does not affect how the benchmark runs. I do not recommend turning this option off

Real data is run if the --data_dir flag is passed, and synthetic data is used otherwise.

--datasets_use_prefetch also only causes an API change: with the option, tf.data is also used for prefetching and without the option, the deprecrated StagingArea API is used instead.

From this observation, I am guess now if I set 'use_datasets' flag True, then it stores already read dataset somewhere in disk and you bring them all from disk to main memory before starting the training iteration.

It doesn't bring the entire dataset to memory, but it brings large chunks at the time. This is done regardless of --use_datasets and --datasets_use_prefetch, but the API used to bring the chunks into memory differs depending on whether those flags are used. If you still want to use tf_cnn_benchmarks over the official models, I don't recommend turning either option off as it causes deprecated APIs to be used that are no longer well tested.

gangmuk commented 4 years ago

thank you very much for quick answer.

Can I ask a few more followed up questions?

  1. Can we know how many and how large chunks of dataset TF brings from disk to memory?
  2. Can we control this number for the number of chunks and size of chunks?
  3. There are other multiple flags related to data loading. such as

    --datasets_parallel_interleave_cycle_length: Number of parallel file readers interleaving input data. (an integer)
    
    --datasets_parallel_interleave_prefetch: The number of input elements to fetch before they are needed for interleaving. (an integer)
    
    --datasets_prefetch_buffer_size: Prefetching op buffer size per compute device. (default: '1') (an integer)
    
    --[no]datasets_repeat_cached_sample: Enable use of a special datasets pipeline that reads a single TFRecord into memory and repeats it infinitely many times. The purpose of this flag is to make it possible to write regression tests that are not bottlenecked by CNS throughput. Use datasets_use_caching to cache input data. (default: 'false')
    
    --[no]datasets_sloppy_parallel_interleave: Allow parallel interleave to depart from deterministic ordering, by temporarily skipping over files whose elements are not readily available. This can increase througput in particular in the presence of stragglers. (default: 'false')
    
    --[no]datasets_use_caching: Cache the compressed input data in memory. This improves the data input performance, at the cost of additional memory. (default: 'false')
    
    --[no]datasets_use_prefetch: Enable use of prefetched datasets for input pipeline. This option is meaningless if use_datasets=False. (default: 'true')

    Should we find the optimal parameter by trying different combination with those?

or Is default setting enough in most cases? like I couldn't see any meaningful difference in terms of throughput when I increase ,for instance, "datasets_parallel_interleave_cycle_length".

  1. Flag "datasets_use_caching" is for when 2nd epoch, if you touch the same data from 1st epoch, so you cached the dataset in memory during 1st epoch and benefits it from 2nd epoch? Am I understanding them correctly?

Thank you again :)

reedwm commented 4 years ago

@rohan100jain, can you answer these questions?