Parallelly loading multiple fies

Mohammedhusen commented 5 years ago

Hi Team,

Does OpenCGA support loading multiple files at a time parallelly? Currently, we tried submitting multiple files at a time but it loads sequentially.

j-coll commented 5 years ago

Hi @Mohammedhusen ,

How are you doing the load submission? One command per file to be indexed, or multiple files? Using the default storage engine opencga-storage-mongodb, the best way to load batches of files is providing multiple files in the same command. e.g.

opencga.sh variant index --file file1.vcf.gz file2.vcf.gz file3.vcf.gz ....

In order to get parallel load you need to connect the daemon to a queue manager. Currently supported SGE and AzureBatch (which is not useful outside Azure). Alternatively, you can run opencga-analysis.sh instead of opencga.sh , which will run the index directly. This command is quite similar to its counterpart in opencga.sh, but instead of add a new job to the queue, it will run the job directly. Also:

You need to provide an empty --outdir folder, where it will write the intermediate files. This needs to be a filesystem directory.
You can provide a catalog --path where to store the intermediate files. This step is optional. You can skip this, and remove the intermediate files.

If you have a different queue system not supported, like SLURM or LSF, you can queue the opencga-analysis.sh command. Be aware that by doing this, you will manage the parallelism level. Better try first providing a batch of files in the same command.

You can also check other load options, like "--merge-mode", that could help you improving the performance. See Indexing Genomic Variants - Merge mode

Mohammedhusen commented 5 years ago

Hi @j-coll,

Thank you for your quick response on this.

I was submitting one command per file and now will try with multiple files as you suggested.

Thanks once again!

opencb / opencga

Parallelly loading multiple fies #1354