qiime2 / galaxy-tools

Official QIIME 2 tools for Galaxy
BSD 3-Clause "New" or "Revised" License
1 stars 4 forks source link

Reorganize/improve import tool #37

Open bernt-matthias opened 2 years ago

bernt-matthias commented 2 years ago

I have a hard time figuring out how to import data into qiime2 tools using the import tool. I guess the most frequently used data is demultiplexed fastq.gz (maybe + sample data tsv file), e.g https://data.qiime2.org/2022.8/tutorials/importing/casava-18-single-end-demultiplexed.zip. I failed to find the corresponding option in the import tool.

To get me started with exploring downstream tools it would be nice if someone could tell me for now how I could import data like the above (is there already a Galaxy specific tutorial that I did not notice so far?).

I guess the main problem is that the mapping between Galaxy concepts and qiime2 concepts needs a bit of improvement (e.g. that galaxy data types and collection types are not used yet). But probably its also because I'm unexperienced with qiime2 .. at the moment I'm just guessing that the goal of the import is to create a single qza dataset from all fastq files? Also I'm missing info in the help (like the definition of what a manifest is).

Since the tool is auto generated I'm unsure if this is easily possible. An alternative would be to handcraft an import tool covering the most frequently used types of input data that has a tight integration of the Galaxy concepts.

I imagine a tool that takes as input either

with format fastq.gz plus (in addition simple data inputs with multiple="true" might be useful [because some users don't seem to like collections for some reason])

The tool then automatically knows about the phred encoding due to the specific Galaxy fastq.gz sub-datatypes.

ebolyen commented 2 years ago

Hey @bernt-matthias!

Yeah there's definitely some mild impedance here, this section of our tutorial should go over the "easy" way to do this: https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/020-tutorial-upstream/030-importing.html

But generally speaking, QIIME 2 doesn't have a notion of "collections" per-se, instead we are indeed trying to place all of those fastq.gz into a single QZA (we've found this to be pretty user-friendly). But to get the data into that QZA, we're expecting a galaxy collection and then we use a regular expression on the element IDs to figure out which is forward vs reverse. This is the same regular expression that we use to validate the user has given us a directory containing the appropriate files (we're quite file oriented).

There's really no equivalent concept of paired data in QIIME 2, as it's all defined by the format, which is expecting some directory structure. Instead we rely on the semantic type to indicate paired-ness, since many tools will use the default Casava layout. In principle, you should be able to upload a directory of raw reads from the sequencing instrument and place them in a collection (not paired, just a boring collection) and then probably add the file-extension of .fastq.gz if the upload stripped the file extension already. From there we go through some real pain to find the element IDs and reconstruct a temporary directory of the right shape for import to QZA.

ebolyen commented 2 years ago

Also I should mention that the Manifest style formats you mention for this particular type were a hack for importing which can basically never work in Galaxy, as they expect real filepaths to exist.

I have a rather informal proposal for modifying directory formats to better suite Galaxy as well, perhaps there is a way to indicate pair-ed-ness in this realm, which we could then automatically map to Galaxy's paired collections.

bernt-matthias commented 2 years ago

Thanks for the clarifications and in particular for the link.

bernt-matthias commented 2 years ago

Hi @ebolyen is there some documentation on the expected file names for the different input types (which might be added to the Galaxy tool help)?

I'm (better a colleague) currently struggling to import data: I'm using Type of data to import: SampleData[PairedEndSequencesWithQuality]

With QIIME 2 file format to import from: CasavaOneEightLanelessPerSampleDirFmt

Unexpected error importing data:
Unrecognized file (/work/songalax/galaxy-dev/database/jobs_directory/020/20999/working/q2galaxy-importb4s4e8x1/metadata-hs-t1.txt) for CasavaOneEightLanelessPerSampleDirFmt.

With QIIME 2 file format to import from: | CasavaOneEightSingleLanePerSampleDirFmt

Unexpected error importing data:
Missing one or more files for CasavaOneEightSingleLanePerSampleDirFmt: '.+_.+_L[0-9][0-9][0-9]_R[12]_001\\.fastq\\.gz'

The latter is kind of clear from the error message since the regex does not match our file names: ids.txt

Could you give us some advice which import format we should choose, or if we should rename our data?

ebolyen commented 2 years ago

Hey @bernt-matthias,

Sorry for not getting back to you. For user-support the forum is much more closely observed.

Regarding the error. Yeah that's definitely an unhelpful error. Your IDs look ok, although I see

qiime2 metadata tabulate on data 142: visualization.qzv
metadata-hs-t1.txt

in your list, which I presume isn't actually in the collection.

I would try setting the append an extension option to fastq.gz if the IDs in your collection are something like:

29_4_S83_R1_001

as QIIME2 is trying to match the entire collection element identifier to the directory regex.

bernt-matthias commented 1 year ago

Sorry for not getting back to you.

No worries :)

For user-support the forum is much more closely observed.

Wondering if you want to add a link to the forum to the tool's help section?

Your IDs look ok, although I see

Oh, yes. That is probably it.

bernt-matthias commented 1 year ago

Just have read this again

Also I should mention that the Manifest style formats you mention for this particular type were a hack for importing which can basically never work in Galaxy, as they expect real filepaths to exist.

Would relative path work?

ebolyen commented 1 year ago

Unfortunately no, you would need to have an absolute path the the /some/galaxy/managed/path/001.dat file which you happen to know is a fastq.gz. If you can predict those paths then it would work... presuming the data was in fact on the same host as the job was on which is also not likely to be true.

I'm working on something right now that may clean this up, but no particular ETA. Until then, using the directory formats is your best bet as you have control over the element identifiers which can be made to match the expected relative path of the directory format (as tedious as that is).

bernt-matthias commented 1 year ago

Unfortunately no, you would need to have an absolute path the the /some/galaxy/managed/path/001.dat file which you happen to know is a fastq.gz. If you can predict those paths then it would work... presuming the data was in fact on the same host as the job was on which is also not likely to be true.

Indeed, this assumption does not hold in all Galaxy installations.

I'm working on something right now that may clean this up, but no particular ETA.

+1

Until then, using the directory formats is your best bet as you have control over the element identifiers which can be made to match the expected relative path of the directory format (as tedious as that is).

Thanks

bgruening commented 1 year ago

Hi guys! Now that we have the tools on EU we get this problem as well :)

It there any workaround yet?

bernt-matthias commented 1 year ago

Workaround seems to be to not use the manifest for importing. For now we have to educate users to maybe use the manifest (which is just a metadata table, or?) to construct a collection and use this for the import.