qiime2 / galaxy-tools

Official QIIME 2 tools for Galaxy
BSD 3-Clause "New" or "Revised" License
1 stars 4 forks source link

Use specific data formats #36

Open bernt-matthias opened 1 year ago

bernt-matthias commented 1 year ago

I just started to explore the qiime2 Galaxy tools. Obviously starting with the import tool I noticed that often the unspecific format="data" is used, e.g.

https://github.com/qiime2/galaxy-tools/blob/4456c16e2ebebbf1c18b23be1f2b794be560b7d5/tools/suite_qiime2_core__tools/qiime2_core__tools__import.xml#L604

this should be avoided, in particular if there are corresponding datatypes in Galaxy. In this specific example format="fastq.gz" seems appropriate. But there are also fastqsanger.gz or fastqillumina.gz if a specific phred encoding is required.

ebolyen commented 1 year ago

To accomplish this, we would need some kind of mapping of our formats to Galaxy formats. And since both of these frameworks allow extension, we'll probably always need "data" as an escape hatch. That said, there may be room to use EDAM to figure out mappings where they exist, I had imagined something along those lines a long time ago, but we've never really gotten around to it.

From there it ought to be possible observe that a Galaxy collection which contains entirely a certain type would be compatible with our file collection and thus constrain the collections available to import. (Do Galaxy collections have an observable format?)

bernt-matthias commented 1 year ago

Seems that the EDAM annotation is present for the Galaxy data types. Is there a list of qiime datatypes somewhere, maybe with EDAM annotations?

In general collections can contain datasets of different types. On the tool side one can use the format attribute of param also for data_collection inputs. But I'm not sure if this checks all or only the first element. We could check and work on solutions to change this if necessary .. maybe also an additional validator can be used. Or one simply documents that users are required to use only uniform collections.

I find the discussion on automatically generated tool wrappers quite enlightening, because it often sheds light on shortcomings of Galaxy (or its tool framework).

As a further comment on collections: they are a nice way to generate parallelism.

bernt-matthias commented 3 months ago

ping @ebolyen .. seems that we had the same ideas already earlier :)

as just discussed: I will try to produce a figure (or a hierarchical datatype like yaml/json) of Galaxy's datatype hierarchy annotated with edam_format and edam_data entries .. then we can think of a mapping .. maybe with some help of @matuskalas

bernt-matthias commented 3 months ago

Create a little script over here. There are a few datatypes deriving from more than one class. So I used only the first in the MRO.

Result can be found here.

If needed you can probably run it using

export PYTHONPATH=$(pwd)/lib/
python hierarchy.py

probably some additional python modules are needed.