peterjc / galaxy_blast

Galaxy wrappers for NCBI BLAST+ and related BLAST tools.
76 stars 70 forks source link

Data manager for the BLAST database *.loc files? #22

Open peterjc opened 10 years ago

peterjc commented 10 years ago

Can we use the new Galaxy Data Manager framework to make it easier to manage the BLAST databases configured via *.loc files? https://wiki.galaxyproject.org/Admin/Tools/DataManagers/

peterjc commented 10 years ago

Related to this, writing tests with automatically installed *.loc databases would be useful (see also issue #3). This would be needed to test rpsblast, rpstblastn and delatablast which require a domain database.

ddooley commented 10 years ago

Looks like an attractive way to go! Thanks for introducing me to this.

peterjc commented 10 years ago

Daniel Blankenberg has done some related work on this here, CC @jj-umn http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_example_blastdb_ncbi_update_blastdb

ddooley commented 10 years ago

Thanks for the reference! Something like this data manager approach is a needed plank in what we want to do. Not sure if I mentioned this but we're seeing a plethora of specific curated gene databases (having "primary target" sequences), e.g. http://www.cpndb.ca 's CPN60 Chaperonin database, or hpa.org.uk's Legionella mip database) which are very useful for distinguishing bacterial clades but which currently have no data connection to Galaxy. So we'd like to develop a Galaxy tool that acts as a (more or less generic) gateway to each of these reference databases. A way to create, describe (using ontology), and periodically synchronize with 3rd party sources those reference databases needed in an institution's Galaxy install. (Of course one challenge is that some of these db's have only a web query form online, rather than direct web URLS to their file(s), but that's another battle.).

Damion


From: "Peter Cock" notifications@github.com Sent: Wednesday, January 08, 2014 7:15 AM To: "peterjc/galaxy_blast" galaxy_blast@noreply.github.com Cc: "Damion Dooley" damion@learningpoint.ca Subject: Re: [galaxy_blast] Data manager for the BLAST database *.loc files? (#22)

Daniel Blankenberg has done some related work on this here, CC @jj-umn http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_example_blastdb_ncbi_update_blastdb

Reply to this email directly or view it on GitHub.

mike8115 commented 10 years ago

I'd be interested to see the NCBI BLAST wrappers use the data tables so that it can be used with data managers. I've been planning on getting a data manager work for the NCBI databases, but I realized that it's not going to work if the BLAST wrappers are still using ...

peterjc commented 10 years ago

See also the recent paper on the Galaxy Data Managers, Blankenberg et al (2014) Wrangling Galaxy's reference data http://dx.doi.org/10.1093/bioinformatics/btu119 and associated wiki pages https://wiki.galaxyproject.org/Admin/Tools/DataManagers

Paging @jj-umn - are you still planning to look at this (data managers for BLAST+), or should we try to get Daniel Blankenberg more directly involved?

jj-umn commented 10 years ago

On 4/1/14, 1:01 PM, Peter Cock wrote:

See also the recent paper on the Galaxy Data Managers, Blankenberg et al (2014) /Wrangling Galaxy's reference data/ http://dx.doi.org/10.1093/bioinformatics/btu119 and associated wiki pages https://wiki.galaxyproject.org/Admin/Tools/DataManagers

Paging @jj-umn https://github.com/jj-umn - are you still planning to look at this (data managers for BLAST+), or should we try to get Daniel Blankenberg more directly involved?

— Reply to this email directly or view it on GitHub https://github.com/peterjc/galaxy_blast/issues/22#issuecomment-39237540.

It is still on my long list of TODOs, but I just thought it was something that should be done. I'm fine with whomever can get to it first.

James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota

mike8115 commented 10 years ago

I could try and lend a hand with this if everyone is busy with other tasks.

jj-umn commented 10 years ago

On 4/2/14, 8:52 AM, mike8115 wrote:

I could try and lend a hand with this if everyone is busy with other tasks.

— Reply to this email directly or view it on GitHub https://github.com/peterjc/galaxy_blast/issues/22#issuecomment-39332230.

Sounds good.

I'm presuming you've found: https://wiki.galaxyproject.org/Admin/Tools/DataManagers And there are a few data managers in the testtoolshed.

JJ

James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota

mike8115 commented 10 years ago

Sorry for the wait, I did use the wiki pages on Data Managers. I've got it working in my local instance of Galaxy and my group's development instance and pushed the changes to my fork of galaxy_blast. Is there anything I should look at or do before making a pull request here?

peterjc commented 10 years ago

I have a tentative plan for where to put the data managers in the repository - https://github.com/peterjc/galaxy_blast/commit/d24a5e6a96f17721d076859ac9975ffc6c012e83 - it seems clearer to me not to put them under the existing tools folder.

@mike8115 - I'll comment on your fork's commit about this and other folder placement issues... it might be best to use a named branch rather than working on your master branch.

Also given you've started from Daniel's http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_example_blastdb_ncbi_update_blastdb I'd like to explicitly get his permission to include his work here with proper attribution. Update email sent http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/018987.html

blankenberg commented 10 years ago

Hi all, this sounds great. Let me know what I can do to help. Thanks.

peterjc commented 10 years ago

Thanks Dan :) So, first of all have you any objection to your NCBI update example being added to this repository under the MIT licence?

Assuming that's fine, would you prefer to do it this yourself as a pull request (subject to debating folder names etc), or let me prepare a commit with @blankenberg as the author (which I can do on a feature branch for your approval, before applying to the master branch)?

e.g. I don't follow why you have some files under data_manager_blastdb/ and others under data_manager_blastdb/data_manager/. See also https://github.com/peterjc/galaxy_blast/commit/d24a5e6a96f17721d076859ac9975ffc6c012e83 - I want to be able to add sister folders in future for any other BLAST related data managers, for example perhaps data_managers/pgsc_blastdb/ for setting up and fetching BLAST databases based on the Potato Genome Sequencing Consortium's releases, or data_managers/human_blastdb/ for the human genome?

Once that's done, we can rebase/reapply @mike8115's work which will clearly show his changes.

We'll also need to discuss where the BLAST data manager(s) will live on the Tool Shed, which could be an under the IUC account. That could mean deprecating the NCBI BLAST data manager example in favour of a new location?

blankenberg commented 10 years ago

I have no objection to an MIT license.

What ever is easier/more convenient for you.

As far as paths are concerned, the data_manager_blastdb is just the repository root, with data_manager being a sub-directory containing the tool script and xml. You should be able to have whatever directory structure that will fit your use case: i.e. the data manager stuff doesn't require any special-cased named directory. Any number, including none, of subdirectories is fine, as long as the relative paths point to the needed files. [The reason for why I did it in that particular way is just aesthetics -- to mirror a 'tools/data_manager' directory, with the additional configurations in the parent directory] So doing what you propose will work well, but we want to avoid having several (near) exact copies of any required tool scripts due to the sub-directories -- without looking at exactly the differences between getting a blastdb release from PGSC or human from ncbi, it is hard to claim which is best, but one could have a single Data Manager tool with e.g., a dropdown to select the source location.

Having the new, more useful, data manager under the IUC would be great. I have no problem deprecating the existing NCBI BLAST data manager when it is ready.

mike8115 commented 10 years ago

If either of you are busy with other tasks, I could make the changes and submit the pull request. Since my group is interested in having this feature available in Galaxy, I could dedicate the time to finish it. But if you have the time, I suppose you would be able to make the necessary changes to bring the tool up to IUC standards a lot faster than I would.

Splitting the data manager would create a reasonably different script in this case. Presently, I use BLAST's update_blastdb.pl script to retrieve nucleotide and protein databases. Protein domain databases are retrieved by using the ftplib module since they are not available via update_blastdb.pl. I feel that most data managers would be inherently similar, given that the input and output mainly differs by the name and source location, but having one data manager tool controlling too many databases could get messy visually.

ddooley commented 10 years ago

I'll just throw this out there if any of you are interested ... its a tangent on the subject of recreating search results, namely, being able to specify a date or version of a database to recreate the search for. There's a scenario where we'd bring up a slightly customized version of the blast search that has an extra input for entering a date, and behind the scenes we determine which version of a database we'd need, and recall that version to search it. So first question I have is have others been getting requests to provide this kind of a solution?

I tested out git to see if it could quickly bring back versions of a largish nucleotide fasta file. Gits diff algorithm completely flops unless you format each fasta entry as a single line (tab delimited) merged from a multi-line entry. But it does handle say 20 versions of a 640mb file pretty well, being able to recreate a version within 8 seconds or so. I've tested it up to about 2 gb, where it takes about 30 seconds to retrieve a version. Seems like the formula is roughly 50mb / second.

Another git test however flopped on a protein database formatted in the same fashion. What happens is git decides to delete every row of an old version file and then insert every line of a new version. Very fussy about the composition of lines in its diff algorithm.

In the end I scripted a python diff that does the same thing in a low tech fasion, and it handles any size of fasta file, creating a database of versions that's about 1.1x bigger than latest version, at about the same rate as git. So thinking of that now as the mainstay for the scheme.

In terms of blast/fasta database management I've been considering another approach that sits outside of galaxy, namely the http://biomaj.genouest.org/ biomaj software, which focuses on regular scheduled downloading. Are any of you familiar with it? Our comrades at the National Microbiology Lab in Winnipeg have been using it and have been satisfied. Apparently it can trigger hooks both before and after file download to customize synchronization/generation of the file-based databases from fasta or other file downloads.

I can imagine the same thing done under the Galaxy hood too - a scheduled download + processing hooks? Currently we're also using update_blastdb.pl which is ok but we agree lacks a gui to enable easy management and monitoring of data sources.

Feedback appreciated,

d.


From: "Michael Li" notifications@github.com Sent: Monday, April 07, 2014 2:17 PM To: "peterjc/galaxy_blast" galaxy_blast@noreply.github.com Cc: "Damion Dooley" damion@learningpoint.ca Subject: Re: [galaxy_blast] Data manager for the BLAST database *.loc files? (#22)

If either of you are busy with other tasks, I could make the changes and submit the pull request. Since my group is interested in having this feature available in Galaxy, I could dedicate the time to finish it. But if you have the time, I suppose you would be able to make the necessary changes to bring the tool up to IUC standards a lot faster than I would.

Splitting the data manager would create a reasonably different script in this case. Presently, I use BLAST's update_blastdb.pl script to retrieve nucleotide and protein databases. Protein domain databases are retrieved by using the ftplib module since they are not available via update_blastdb.pl. I feel that most data managers would be inherently similar, given that the input and output mainly differs by the name and source location, but having one data manager tool controlling too many databases could get messy visually.

Reply to this email directly or view it on GitHub.

peterjc commented 10 years ago

The originally blastdb.loc which was written by the Galaxy team recommended simply having multiple date-stamped copies of BLAST databases (each with a line in the *.loc file) for full reproducibility. However, that means 'wasting' a lot of disk space for large regularly updated databases like NT/NR, so we (and I suspect most administrators) simply have one regularly updated copy of NT/NR (using update_blastdb.pl or another mechanism). This does break full reproducibility.

Perhaps ideally the NCBI BLAST database manager could support either approach, or is that too complicated?

peterjc commented 10 years ago

I've started integrating Dan's data manager into this repository on the branch https://github.com/peterjc/galaxy_blast/tree/data_manager - the initial commit https://github.com/peterjc/galaxy_blast/commit/21d7cff3e8dca13c4bd0f716a79a9f00c59f0b5c just checked in the files from Dan's Test Tool Shed version (adjusting the folder structure), the later commits are just to minor tidying.

@blankenberg - does this look OK to you? If so, I will apply that to the master branch. If not, we can tweak things.

@mike8115 - once that is done, we'll look at rebasing your work on top of this.

mike8115 commented 10 years ago

Currently, the data manager tool uses the recommended method, which @peterjc mentioned, of retaining multiple copies of a BLAST database. For something like @ddooley's diff script, it is something worthwhile to look at. It would reduce the disk space requirement and maintain reproducibility, but I wouldn't know how to implement that. To accommodate that approach, we would have to rewrite the script to handle the fasta files, rather than the pre-formatted databases. Beyond that, I'm not sure what to do about that.

In regards to having a scheduled download, from the tool's perspective, it only runs on demand. Having it run routinely is something beyond the data manager tool's control, but I presume it would be possible to do that externally via API?

@peterjc - Sounds great. Let me know if anything needs to be done on my end.

blankenberg commented 10 years ago

@peterjc Looks good to me. (But haven't had a chance to test it yet)

@mike8115 e.g., A cron job could be set up to run a script against the API to automatically run the data manager tool to fetch new database versions.

peterjc commented 10 years ago

@blankenberg - thanks, I've pushed that to the master branch now. I've not tested it yet either - hopefully adding it to the TravisCI setup will be straightforward. I'll also want to add installation instructions, and the tar command for preparing a ToolShed upload.

peterjc commented 10 years ago

@mike8115 I've tried to rebase/merge your work on the branch https://github.com/peterjc/galaxy_blast/tree/data_manager - see https://github.com/peterjc/galaxy_blast/commit/42cb87541c823948f89ced269981b94e83b5fb78

Can you have a look at this, particularly the unit tests - was there a reason for dropping Dan's original test?

peterjc commented 10 years ago

Looking at @mike8115's work, he uses the Data Table approach - https://wiki.galaxyproject.org/Admin/Tools/Data%20Tables - to switch the BLAST+ wrappers use of the *.loc files from:

    <param name="database" type="select" label="Nucleotide BLAST database">
        <options from_file="blastdb.loc">
            <column name="value" index="0"/>
            <column name="name" index="1"/>
            <column name="path" index="2"/>
        </options>
    </param>

To the shorter:

    <param name="database" type="select" label="Nucleotide BLAST database">
        <options from_data_table="blastdb" />
    </param>

The column information is instead defined via tool-data/tool_data_table_conf.xml.sample:

    <table name="blastdb" comment_char="#">
        <columns>value, name, path</columns>
        <file path="tool-data/blastdb.loc" />
    </table>

However, this new XML file is a potential dependency headache - which ToolShed repository would it belong to - probably a new (common) dependency like the BLAST datatypes?

My instinct right now is not to use the Data Table approach at all - aside from the dependency problem, this would actually increase the number of lines of XML in our code base since right now our *.loc file columns are defined once in ncbi_macros.xml.

I've started a thread on the galaxy-dev list to discuss this issue: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/019023.html / http://dev.list.galaxyproject.org/Data-Tables-and-loc-files-Using-named-columns-versus-from-data-table-tc4664149.html

mike8115 commented 10 years ago

@peterjc I dropped the original test because the script generated unique ID's by generating a hash value for all the directories and folders. I changed that to use the date instead since the databases update on a daily basis anyway. I suppose I could have modified his test to use regular expressions instead of a strict match.

If we decide to move away from data tables, I'm not too sure how else to get tools to use new databases downloaded by the data manager.

peterjc commented 10 years ago

Updating Dan's test to use regular expressions would be nice, but on the other hand the est database is a big one to download just for running a test!

Ah. Assuming we must use the Data Table interface for the Data Manager, won't that ultimately update the blastdb*.loc files which the BLAST+ wrappers can continue to use as before?

jj-umn commented 10 years ago

I'll let Dan correct me if I'm wrong, but I think the entries from various |blastdb.loc files would be merged in memory but not persisted to the single file: /||blastdb.loc

blankenberg commented 10 years ago

@jj-umn is correct. Data managers use namespacing on the .loc files that are installed from a toolshed, i.e. the entries created by the data manager tool will not end up in tool-data/blastdb.loc.

peterjc commented 10 years ago

More detailed reply from @blankenberg on the mailing list:

Having a standalone repository that just contained the tool data table and .loc file
that could be a dependency of other repositories would be a good way to go here.
Unfortunately, this isn’t supported right now. I’ve opened a trello card for this:
https://trello.com/c/VZxV08Qt

However, even though you currently need to include the tool data table definition
and .loc sample in each repository in order for the tool to be valid, it is still
a best practice to use tool data tables.

See http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/019027.html - and the Trello Issue https://trello.com/c/VZxV08Qt which says:

Currently a repository with a tool that requires a tool data table must have that
tool data table included within its own repository. This causes duplication of this
files in each repository that needs them.

We could allow a repository having (just?) the data table definition and the
.loc.sample to be a dependency of other repositories.

Bonus points if we were to allow optional namespacing of the table name based
upon its repos (since there is currently a possibility of name collisions).

It looks like we could have identical copies of the tool-data/tool_data_table_conf.xml.sample and the tool-data/*.loc.sample files included in multiple ToolShed repositories. They should be almost static so version clashes should not be a problem?

mike8115 commented 10 years ago

I would say that tool_data_table_conf.xml.sample and *.loc.sample needs to be consistent across all tools, but I haven't tested to see what happens if there are inconsistencies in the tool_data_table_conf.xml.sample file (i.e. Having two data tables of the same name, but with different column definitions). If the trello card that Dan opened gets fulfilled, it would definitely clear up the dependency issues.

@peterjc Aside from removing Dan's initial test with the EST database, I don't see other concerns. https://github.com/mike8115/galaxy_blast/commit/98326fa75516f7d8fe8a270de5d76fb02aa182e6 contains my last edits based on your current repository.

peterjc commented 10 years ago

@mike8115 - why did you change the tool ID from data_manager_blast_db to ncbi_blast_plus_update_db? If we're making substantial changes to Dan's wrapper there doesn't seem to be any harm changing it, but how about ncbi_blastdb which I think is short and sweet (and matches the folder name). In fact, we might as well rename the XML file at the same time?

Another thought: Does it make sense to use the last modified date of the *.tar.gz files on the NCBI FTP site rather than the download date?

Also, for now at least, I will not change ncbi_macros.xml to use the Data Tables (i.e. using `

mike8115 commented 10 years ago

I didn't notice the ID difference. Changing it to ncbi_blastdb is fine with me and it would make sense to change the XML file to match.

Changing the script to do that should be quick. https://github.com/mike8115/galaxy_blast/commit/0785fea8c18eec3ae41ae453fc02354b5a608005 holds all of those changes.

With the data tables, the .loc files from the NCBI BLAST wrappers won't be modified as far as the data manager is concerned. New databases from the data manager are added to the .loc file in the data manager's files. The data table is simply a list of known .loc files that a tool can access to receive entries that it could otherwise not get. Just keep in mind that until the wrappers are set to look at other .loc files, it won't benefit from the data manager unless users manually add the entries from the data manager's .loc file to the wrappers' .loc file.

peterjc commented 10 years ago

@mike8115 I've asked Dan about the pre-existing ID inconsistency, see comments on https://github.com/peterjc/galaxy_blast/commit/21d7cff3e8dca13c4bd0f716a79a9f00c59f0b5c

And regarding the *.loc files, I appreciate that until we switch the BLAST+ wrappers to using the Data Table approach they won't see any databases added via the Data Manager. I want to keep the wrapper on the master branch stable & ready to release while we work on the Data Manager (which thus far I have not been able to test).

peterjc commented 10 years ago

@mike8115 I'm having trouble getting your tests to pass, both locally and via TravisCI. Work in progress here: https://github.com/peterjc/galaxy_blast/commits/data_manager2

For the protein database tests, the patent amino acids isn't too large so it might be the best choice (pataa.tar.gz is about 250MB). Likewise, for the protein domain databases, Cog is the smallest about about 160MB.

However, for the nucleotide databases, the patnt database is big (three and a bit chunks, so about 3GB). How long does it take to download for you? It would be good to have a smallish multi-file database but even est_human is about 1GB to download (est_human.00.tar.gz and est_human.01.tar.gz). How about we use the tiny half a megabyte vector.tar.gz instead?

mike8115 commented 10 years ago

Hi Peter,

Thanks for looking into this. Sorry about that mismatch in the tool xml file. It was a sloppy error on my part. I agree with your choices for the databases. I really didn't consider the size of the tests when I added them. Do you want me to rewrite the tests?

peterjc commented 10 years ago

@mike8115 Great - can you work from that data_manager2 branch to modify the tests? Also have I missed anything in the installation instructions for the README file?

mike8115 commented 10 years ago

@peterjc I've copied your branch into my repo and made the changes there https://github.com/mike8115/galaxy_blast/commit/17d760a3e5e07dac4c00758cbca706900199d431. From my machine, the tests completed in 80s. The README looks great, but I've removed some repeated lines in the manual installation section. Otherwise no changes are needed.

Also, there's still some outstanding changes mentioned https://github.com/peterjc/galaxy_blast/issues/22#issuecomment-40212483 that has yet to be applied in your branch.

abretaud commented 9 years ago

Hi, I am writing some code to allow biomaj software (http://biomaj.genouest.org discussed above) to automatically update databank list using the data manager. It works fine, but the only thing that is missing for blast tools is to update ncbi_macros.xml to use the Data Tables (i.e. using <option from_data_table="..."/>). Is there any progress on this? It would be great if the modification was applied to the stable wrappers as it is the last small bit of code to make all the data table/manager stuff work for blast tools. I have been playing a lot with data managers lately and it seems to work quite well now. Anthony

peterjc commented 9 years ago

I've filed #52 for the specific sub-issue of using the new Data Tables approach to defining the columns in the *.loc files for our BLAST databases.

ddooley commented 9 years ago

Ah, I'd like to hear about this when you are finished, Anthony. I'm just trying out Biomaj now; (as well finishing a versioned database recall system for galaxy but it doesn't use data managers.)

Damion


From: "abretaud" notifications@github.com Sent: Monday, December 08, 2014 7:48 AM To: "peterjc/galaxy_blast" galaxy_blast@noreply.github.com Cc: "Damion Dooley" damion@learningpoint.ca Subject: Re: [galaxy_blast] Data manager for the BLAST database *.loc files? (#22)

Hi, I am writing some code to allow biomaj software (http://biomaj.genouest.org discussed above) to automatically update databank list using the data manager. It works fine, but the only thing that is missing for blast tools is to update ncbi_macros.xml to use the Data Tables (i.e. using

Reply to this email directly or view it on GitHub.

abretaud commented 9 years ago

@ddooley Yep, no problem! The code will be available on github. I'm currently testing it, fixing some bugs, it should be ready soon. I will tell you when i am done

peterjc commented 6 years ago

In discussion with @blankenberg and @bimbam23 in Portland at GCCBOSC 2018, this example would be useful to look at - it can fetch from pre-defined sources, a URL, or a history entry:

https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_fetch_genome_dbkeys_all_fasta

It would be nice to be able to pull in a BLAST database from your history, pull in a FASTA file from your history or a URL and build the database with makeblastdb, or download a zipped database from a remote server.

chambm commented 2 years ago

Has anybody started implementing this yet?