Shuffle script for training runs out of memory

anands-repo commented 4 years ago

Describe the issue: Shuffle script for tfrecords (https://github.com/google/deepvariant/blob/r1.0/docs/deepvariant-training-case-study.md) runs out of memory when using a training set from multiple BAM files.

This is what I followed:

Run make_examples for each BAM file to obtain tfrecords
Run shuffle script (https://raw.githubusercontent.com/google/deepvariant/r1.0/tools/shuffle_tfrecords_beam.py) on all the records from all the BAM files

This requires over 230 GB of CPU RAM, and the process is eventually killed. I do not know whether the memory requirement will keep growing beyond this point. Is there another way to deal with this situation? For example, it would be possible to run shuffling for data from each bam file independently. However, I am not sure what the flow would look like after that point.

Setup

Operating system: Ubuntu Bionic
DeepVariant version: 1.0.0
Installation method (Docker, built from source, etc.): Docker

pichuan commented 4 years ago

Hi @anands-repo , when you say it runs out of memory, are you using DataflowRunner like in our documentation?

anands-repo commented 4 years ago

Hi @pichuan

I am not running on Google Cloud, but on a local machine. So I went with the default runner.

When I use DataflowRunner the shuffle script requests arguments relevant to GCS. For example I get errors such as:

Invalid GCS path (<PATH>), given for the option: temp_location

here is actually a valid path in my machine. Kindly advise. Thanks!

anands-repo commented 4 years ago

Command that works, but runs out of memory is this:

python $SCRIPTPATH/shuffle_tfrecords_beam.py \
  --input_pattern_list=$INPUT_PATTERN \
  --output_pattern_prefix=$OUTPUT_PREFIX \
  --output_dataset_config_pbtxt=$OUTPUT_DATASET_CONFIG_PBTXT \
  --output_dataset_name=$OUTPUT_DATASET_NAME

Command that doesn't run and gives error:

python $SCRIPTPATH/shuffle_tfrecords_beam.py \
  --input_pattern_list=$INPUT_PATTERN \
  --output_pattern_prefix=$OUTPUT_PREFIX \
  --output_dataset_config_pbtxt=$OUTPUT_DATASET_CONFIG_PBTXT \
  --output_dataset_name=$OUTPUT_DATASET_NAME \
  --job_name=$JOBNAME \
  --project=$PROJECT_NAME \
  --temp_location=$TEMPLOCATION \
  --save_main_session \
  --region us-east1

Obtained error: Invalid GCS path (<PATH>), given for the option: temp_location

I also tried the SparkRunner which works, but which runs into the same issue of memory. It seems DirectRunner and SparkRunner try to shuffle everything in memory (RAM) and do not use local storage. May be DataflowRunner uses local storage (it accepts a --temp_location argument)? However, this is not available to me on my local machine since the DataflowRunner seems to require the code to be run on Google Cloud.

pichuan commented 4 years ago

Hi @anands-repo

The point of using Dataflow is to run things in a distributed fashion, which means it shouldn't be running on your local machine. I assume that Spark runner should allow you to run with distributed workers as well if you set it up correctly, but I have never used it myself. Our team doesn't support different distributed setup for Beam. Please refer to https://beam.apache.org/documentation/runners/spark/ to see if you can set up Sparker runner so that it doesn't use your local memory. If you do figure out a good setup to run on Spark, please feel free to share some tips here so people can use it in the future!

If you can't use the shuffle script, you can consider a less fine-grained shuffle "hack" in this older document http://bit.ly/train-deepvariant (Note that this doc is a one-off document, and is not maintained by our team. Please consider it as a possible example that you'll probably need to tweak for your own use case)

anands-repo commented 4 years ago

Hi @pichuan Thanks for the advise. Will look into these possibilities.

The coarse-grained shuffling would be easiest, however it is mentioned that shuffling is an important step in the document you mentioned as well as the training page. Technically stochastic/batch gradient descent does depend on random batches.

I will look into spark, as well as other options like dask or torque (which would need a script hack). If I have a setup that works for local clusters, I will share it.

Thanks!

anands-repo commented 4 years ago

@pichuan

As you know, I am running the shuffle script using Spark. I am wondering how many output files are expected from running the script.

When I use DirectRunner, I get a single output file. When I use the SparkRunner I get as many output files as there are input files fitting the pattern (I have noticed this mismatch between spark/direct runner in another situation as well: https://stackoverflow.com/questions/64450391/apache-beam-beam-flatten-doesnt-flatten-files-with-sparkrunner-but-does-so-wi).

Is this the expected result when using Dataflow runner as well? Basically, I am simply trying to do a sanity check to make sure that the shuffler isn't simply reading in the data and copying it without shuffling, or simply shuffling within each shard.

Thanks!

GuillaumeHolley commented 2 years ago

Just for the record, I wrote a script which shuffles the records locally using as little memory as possible: TFrecordShuffler. It uses about as much RAM as the total size of the input (record) files on disk. Downside is obviously the time it takes which is much longer than with a distributed google cloud or spark system I imagine. As an example, shuffling ~30 million records totaling 125 GB of files took 46h (wall-clock and CPU) and 150 GB of RAM.

pichuan commented 2 years ago

Awesome, thank you @GuillaumeHolley . I'll make a note to add this as an option in our training tutorial in the future.

pichuan commented 2 years ago

@GuillaumeHolley FYI, I'm working on updating the tutorial. I will add this sentence:

NOTE: If you prefer shuffling locally, please take a look at this user-provided shuffler option: https://github.com/google/deepvariant/issues/360#issuecomment-1019990366

If you want to suggest a different sentence in the tutorial, please let me know!

yinshiyi commented 6 days ago

@pichuan when shuffle the datasets using local runner, direct_num_workers is set to 0, it will use all the local CPUs. I got this warning that make me thinking

WARNING:apache_beam.runners.portability.fn_api_runner.fn_runner:If direct_num_workers is not equal to 1, direct_running_mode should be `multi_processing` or `multi_threading` instead of `in_memory` in order for it to have the desired worker parallelism effect. direct_num_workers: 8 ; running_mode: in_memory

Is there a reason we use in_memory rather than other modes?

Thank you.

pichuan commented 6 days ago

Hi @yinshiyi , Hello :D

First, this is a pretty old bug. It might be easier to open a new issue. Otherwise our team member on rotation might not notice it.

To your question, are you asking about https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-training-case-study.md ?

In that documentation, we provided two examples of using the shuffle script. One is with:

  --runner=DirectRunner \

the other one is with:

  --runner=DataflowRunner \

If you intend to use Dataflow, please refer to the command that uses DataflowRunner.

@yinshiyi , if you want to discuss further, please open a new issue with a few more details on which command you were using. That will be helpful for our team member to provide more support. Thank you :)

google / deepvariant

Shuffle script for training runs out of memory #360