shuzhao-li-lab / asari

asari, metabolomics data preprocessing
Other
38 stars 9 forks source link

Data with memory issue #43

Closed xulei99 closed 1 year ago

xulei99 commented 1 year ago

Hello there, I'm using asari to process data, and found this data set would consume extremely more memory. My computer with 32GB memory is not enough for this 10 data. I'm wonder why. (only the Thermo QE HF datasets: https://drive.google.com/drive/folders/1PRDIvihGFgkmErp2fWe41UR2Qs2VY_5G)

jmmitc06 commented 1 year ago

Shuzhao mentioned this in the other issue but I wanted to address this issue directly.

By default asari will store intermediate data in memory on small studies instead of ondisk. A study is considered small based on the number of files which can lead to the edge case where you have a small number of files whose intermediates do not fit into memory resulting in your problem.

I suggest changing this behavior by explicitly telling asari to use 'ondisk' mode. You can do this by copying parameters.yaml in the test directory and changing the database_mode to 'ondisk' and specifying the custom parameter file with --parameters new_filename.yaml

Other parameters have dedicated flags for changing them on the command line. I did not see one for database mode. I will implement one.

jmmitc06 commented 1 year ago

Although this change has not been pushed to the packaged version on pypi yet, I have added the option to the CLI to change the database mode:

In this version, you can append --database_mode='ondisk' to the asari process command instead of using the parameters yaml.

p.s. On my machine, an M2 macbookpro with 32 GB of RAM, I was able to process the dataset locally without issue. Not sure why on your machine it used significantly more memory. I assume you were using the default parameters in asari?

xulei99 commented 1 year ago

Thanks. I have changed the database_mode in parameters.yaml file, and it worked. To be clear, my machine is a DELL OptiPlex 7070 with 32GB memory using Ubuntu 20.04

jmmitc06 commented 1 year ago

Glad to hear it.

I'm going to close this issue since I believe we have resolved this issue. For now, continue to use the parameter file and in the next release the command line option should be sufficient.