opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

Parallelly loading multiple fies #1354

Open Mohammedhusen opened 5 years ago

Mohammedhusen commented 5 years ago

Hi Team,

Does OpenCGA support loading multiple files at a time parallelly? Currently, we tried submitting multiple files at a time but it loads sequentially.

j-coll commented 5 years ago

Hi @Mohammedhusen ,

How are you doing the load submission? One command per file to be indexed, or multiple files? Using the default storage engine opencga-storage-mongodb, the best way to load batches of files is providing multiple files in the same command. e.g.

opencga.sh variant index --file file1.vcf.gz file2.vcf.gz file3.vcf.gz ....

In order to get parallel load you need to connect the daemon to a queue manager. Currently supported SGE and AzureBatch (which is not useful outside Azure). Alternatively, you can run opencga-analysis.sh instead of opencga.sh , which will run the index directly. This command is quite similar to its counterpart in opencga.sh, but instead of add a new job to the queue, it will run the job directly. Also:

If you have a different queue system not supported, like SLURM or LSF, you can queue the opencga-analysis.sh command. Be aware that by doing this, you will manage the parallelism level. Better try first providing a batch of files in the same command.

You can also check other load options, like "--merge-mode", that could help you improving the performance. See Indexing Genomic Variants - Merge mode

Mohammedhusen commented 5 years ago

Hi @j-coll,

Thank you for your quick response on this.

I was submitting one command per file and now will try with multiple files as you suggested.

Thanks once again!