hotpotqa / hotpot

Apache License 2.0
445 stars 75 forks source link

Preprocessing step - memory consumption #31

Closed AndresFRJ98 closed 4 years ago

AndresFRJ98 commented 4 years ago

Hello,

I am trying to get the intial repo to work by following the steps provided. In the preprocessing step, upon running:

python main.py --mode prepro --data_file hotpot_train_v1.1.json --para_limit 2250 --data_split train The pre-processing begins. However, the memory consumption on my machine becomes enormous (using up to 10gb), so I decided to terminate it. How many 'tasks' (as it displays while running) does the preprocessing have to go through?

Is this amount of memory consumption normal or is something wrong with my setup/enviornment?

Thanks

qipeng commented 4 years ago

That amount of memory consumption is expected. You can reduce it by processing fewer tasks at a time (n_jobs in Parallel()). There's one "task" for each example in the dataset, so depending on which split you're processing, it could be ~90k or ~8k.