Galaxy: allow tool_data

dsavchenko commented 7 months ago

Currently, only scripts and xml end up in the tool. Some static data may be needed, like a small and static public dataset in HESS tool or model grids in Phosphoros photoz. It may be either in the repo (lfs) or downloaded from the given URL upon tool creation.

dsavchenko commented 7 months ago

Also, it will be necessary to add test data to the tool. E.g. to enable testing tools with data input

dsavchenko commented 7 months ago

For the test data: https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/remote_test_data_location.xml

As we (almost) imply that the input files are in the repo (so that notebooks run from top to bottom), its raw path can be used If it's not there, will require some user intervention anyway. To add some config

dsavchenko commented 7 months ago

For the test data: https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/remote_test_data_location.xml

@bgruening could you give advice? I'm trying to use the remote test data for our tools, like in a cited example. While testing locally with planemo, tests fail. Moreover, even this functional test fails for me. I'm simply calling planemo test --galaxy_root /path/to/galaxy /path/to/tool.xml
All errors in the report are like Input staging problem: Test input file (hello.txt) cannot be found.

Do I miss something simple here?

bgruening commented 7 months ago

Have you checked that you have a recent planemo/Galaxy version? Remote test files are a new feature.

dsavchenko commented 7 months ago

Have you checked that you have a recent planemo/Galaxy version? Remote test files are a new feature.

Whereas I checked, of course, that the galaxy release I use is at least 23.1 (where this feature was introduced), I was indeed using an older version of planemo. I thought, the role of planemo in running tool tests is more restricted, and it's galaxy itself, who is taking care of the test data staging. With the latest planemo, tests work. Thank you for pointing this out!

Getting back to the first point in this issue, if we have a tool that always use the same static dataset (not a big data) which is always referenced and used as a whole, what is the best (or most "galactic") way to deal with such a situation?

storing it as a tool_data inside the tool. This could be an option for a small data. Say, we have some static "reference" data, the script relies on, like e.g. telescope filter parameters. Datasets of the order of 100MB seem too big to be putted in the toolshed repo. Also, ideologically, it seems to be wrong to duplicate and embed the data which is already available remotely
the workaround is to download a file inside the job. Some of our prototypes do this. Not ideal at all, increase network traffic, job runner environment may be network-restricted etc.
the approach of a data-manager tool. It seems to me the way to go, though it's way more complex overall, requires additional components, requires some dummy hidden parameter to refer the data (am I right). The biggest concern is that it requires some (manual?) admin intervention to actually run the data-fetching tool.
something similar to how this remote test data works. E.g. to have a hidden data parameter which always refer to the remote dataset, then galaxy takes care of staging it for the job. Using deferred dataset, but in some way to bother the user to create it? Overall, not clear if it's ever achievable.

This is more of an extended note for myself, but I appreciate any comments and thoughts...

Also, regardless of the implementation, if this small data has some kind of persistent identifier (doi, etc.) it needs to be expressed somewhere.

bgruening commented 7 months ago

Have you checked that you have a recent planemo/Galaxy version? Remote test files are a new feature.

Whereas I checked, of course, that the galaxy release I use is at least 23.1 (where this feature was introduced), I was indeed using an older version of planemo. I thought, the role of planemo in running tool tests is more restricted, and it's galaxy itself, who is taking care of the test data staging. With the latest planemo, tests work. Thank you for pointing this out!

Cool, glad it works now!

Getting back to the first point in this issue, if we have a tool that always use the same static dataset (not a big data) which is always referenced and used as a whole, what is the best (or most "galactic") way to deal with such a situation?
* storing it as a tool_data inside the tool. This could be an option for a small data. Say, we have some static "reference" data, the script relies on, like e.g. telescope filter parameters. Datasets of the order of 100MB seem too big to be putted in the toolshed repo. Also, ideologically, it seems to be wrong to duplicate and embed the data which is already available remotely

On IUC, which is our best practice recommendation, we recommend keeping test data below 1MB. For everything bigger, I would recommend using external resources. However, you define the rules for the Astro-Galaxy community. And if the test data is static, you can really just go full in and use only remote-test-data.

There are two reasons remote-test-data is maybe not good (that I'm aware of):

remote-test-data might encourage users to use bigger test data than its needed, just because of laziness to reduce the size of some test data ... bigger test data = longer test runs = wasted resources
If the tool is deployed somewhere with restricted internet access people might not get easily to the test data

* the workaround is to download a file inside the job. Some of our prototypes do this. Not ideal at all, increase network traffic, job runner environment may be network-restricted etc.

Once the Galaxy dataset caching is enabled by default the increased network traffic might not be such a big problem, but the restricted network access is.

* the approach of a data-manager tool. It seems to me the way to go, though it's way more complex overall, requires additional components, requires some dummy hidden parameter to refer the data (am I right). The biggest concern is that it requires some (manual?) admin intervention to actually run the data-fetching tool.

Not sure I understand how you would like to use DM with tool tests. Or is this here about some reference/model data?

DM are used to populate location (.loc) files automatically - so that an Admin has less work. Lets forget about DM for the moment and just talk about .loc files.

A location file is a simple tabular file that a tool can read and populate a select box. We use this to distribute large reference/models to all users of a Galaxy instance. So that those models don't need to be downloaded by every user. It also makes it easy to discover those models etc...

In tool tests you can also test locations files ... however here we also usually recommend to use tiny versions of the model etc.

* something similar to how this remote test data works. E.g. to have a hidden data parameter which always refer to the remote dataset, then galaxy takes care of staging it for the job. Using deferred dataset, but in some way to bother the user to create it? Overall, not clear if it's ever achievable.

I assume I'm missing something here. You can use remote-data for all data params. Why would you need a hidden param?

This is more of an extended note for myself, but I appreciate any comments and thoughts...

Also, regardless of the implementation, if this small data has some kind of persistent identifier (doi, etc.) it needs to be expressed somewhere.

Galaxy does support some special protocol-schemas: https://github.com/galaxyproject/galaxy/blob/06eada1c2ce5694d92aaa0bbf312258ff66398e2/client/src/utils/upload-payload.js#L2

I guess we could add a DOI resolver, that would be useful for many use cases I think. The catch is ... if the DOI is some tarball and not a single file, I don't know what Galaxy should do with that.

dsavchenko commented 7 months ago

Not sure I understand how you would like to use DM with tool tests. Or is this here about some reference/model data?

Yes, all this second part is not about the test data (I think we are happy enough with the remote test data approach or putting test data inside the tool if it's small). This is about the data that's actually needed to run a tool. I wasn't clear enough in this, sorry.

Some is model or reference data in your terms. Some is real observational dataset. But in both cases in hand, we don't expect this data to be variable. The model data is fixed for the given tool release, the observational data is sometimes a comparably small (~ 100 MB -- 1GB) data release in the form of one archive, is used as a whole. Given both are static, there is no need in that user selects what to use. I'm looking for the best way to handle such situation without statically including it in the tool release (which could only be an option for some small model data).

A location file is a simple tabular file that a tool can read and populate a select box. We use this to distribute large reference/models to all users of a Galaxy instance. So that those models don't need to be downloaded by every user. It also makes it easy to discover those models etc...

That's why I started to look into it, the same data is used by all users of the specific tool. The difference is that there isn't really a need for any checkbox, that's why I thought of a hidden parameter.

All this is opposite to the use-cases where we have a "big data" problem.

bgruening commented 7 months ago

Oh interesting use-case. How do you distribute those data currently (to the user)?

What we can do, without any changes to Galaxy, is, have a hidden select box that is linked to a location file. The location file has only one entry, the hidden select box has only one entry which is always taken into account.

(

The admin can setup this tool by modifying this location file. This step can then be optimized, by using a DM or simply distributing your data via CMFS (this is how we distribute our reference data).

)

dsavchenko commented 7 months ago

Yes, thanks, that's exactly the mechanism I imagined when talking about DM. If we talk about observational data, in the use-cases I have in mind it's not ours data , it's just available for download via https in the form of archive.

But it requires some action from admin, if we add such kind of a tool in the usegalaxy-eu instance? Or there is a CD action to fetch data/run DM tool?

bgruening commented 7 months ago

Yes, we have this: https://github.com/galaxyproject/idc

But for you, this might be maybe overkill. Most of our reference data is more than download. We actually need to build indices on genomes and so on. If you just need to download maybe we can simply host all of your data on CVMFS and the admins needs to mount this data in and done.

oda-hub / oda-bot

Galaxy: allow tool_data #49