ncbi / ngs

NGS Language Bindings
Other
118 stars 52 forks source link

Concurrent access to a ReadCollection object #10

Closed pditommaso closed 8 years ago

pditommaso commented 8 years ago

Is the ReadCollection Java API thread safe? I would like to use the getReadRange method on the same collection object in different threads in order to parallelise the reads download process.

kwrodarmer commented 8 years ago

The general answer is yes. In particular, you'd want to obtain iterators, or a Reference and get iterators from that. The iterators are not thread safe, of course, but are designed to be used one-per-thread. So an approach of sharing an iterator across several threads will not work, but if you split a ReadCollection into several ranges and get an iterator per range, that will work.

pditommaso commented 8 years ago

OK. It makes sense. Thanks.

One more question: I've noticed that reads are cached in a folder in the user home. Is it possible to specify the location of the cache directory, or eventually to disable it?

kwrodarmer commented 8 years ago

Yes. See documentation on configuring VDB (the engine underneath NGS) using the sra-toolkit:

https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration

kwrodarmer commented 8 years ago

You can set the cache location via configuration. By default (i.e. in absence of any formal input from the user), it chooses $HOME/ncbi/public for open-access data, and $HOME/ncbi/dbGaP-xxxx where the x's indicate a particular project to which you have access.

Turning off caching altogether makes sense if you either a) never re-read data, or b) you have a relatively fast internet connection. In the latter case, you're essentially using NCBI's storage as a networked drive.

pditommaso commented 8 years ago

Nice, thanks. However that documentation does not mention the API. Is that possible to configure that options using the Java API?

kwrodarmer commented 8 years ago

The answer will require a little bit of background...

The NGS API is intended to be vendor neutral and supports simultaneous loading of distinct engines. The NCBI-NGS engine that plugs in underneath is just one of the possible engines that could exist.

As such, we cannot add NCBI-specific features or configuration to the API. What we can do, however, is what we are doing right now, which is to create NCBI-specific extensions to NGS. The configuration class you mention is under development exactly now, since we need it, too. We can keep you updated on progress.

By the way - are you able to comment on what you're doing with the NGS API? We're always curious to hear!

pditommaso commented 8 years ago

That's a good news. Yes, the idea is to integrate the NGS API in the Nextflow framework so that a pipeline script would be able to download/access reads by specifying one or more accession numbers.

You can read more at this link https://github.com/nextflow-io/nextflow/issues/89

kwrodarmer commented 8 years ago

Thanks for the info. I just posted a response there as well.

pditommaso commented 8 years ago

I'm closing this thread and I will open an issue related to the cache configuration options API.