Open Mohammedhusen opened 5 years ago
Hi @Mohammedhusen ,
How are you doing the load submission? One command per file to be indexed, or multiple files?
Using the default storage engine opencga-storage-mongodb
, the best way to load batches of files is providing multiple files in the same command. e.g.
opencga.sh variant index --file file1.vcf.gz file2.vcf.gz file3.vcf.gz ....
In order to get parallel load you need to connect the daemon to a queue manager. Currently supported SGE and AzureBatch (which is not useful outside Azure).
Alternatively, you can run opencga-analysis.sh
instead of opencga.sh
, which will run the index directly. This command is quite similar to its counterpart in opencga.sh
, but instead of add a new job to the queue, it will run the job directly. Also:
--outdir
folder, where it will write the intermediate files. This needs to be a filesystem directory.--path
where to store the intermediate files. This step is optional. You can skip this, and remove the intermediate files.If you have a different queue system not supported, like SLURM or LSF, you can queue the opencga-analysis.sh
command. Be aware that by doing this, you will manage the parallelism level. Better try first providing a batch of files in the same command.
You can also check other load options, like "--merge-mode", that could help you improving the performance. See Indexing Genomic Variants - Merge mode
Hi @j-coll,
Thank you for your quick response on this.
I was submitting one command per file and now will try with multiple files as you suggested.
Thanks once again!
Hi Team,
Does OpenCGA support loading multiple files at a time parallelly? Currently, we tried submitting multiple files at a time but it loads sequentially.