First working version of mzml2isa/nmrml2isa with isa-tab datatype

pcm32 commented 6 years ago

This PR introduces the use of isa-tab datatype as the main result format for mzml2isa. Currently only missing to add the raw data on the isa-tab archive created, I wonder if we should actually do this, as currently the result of the tool were only the I, S, A files. Adding data means we duplicate the data on the server (unless that we do some hard linking I guess).

@djcomlab @proccaserra @sneumann @pkrog @ilveroluca what do you think?

proccaserra commented 6 years ago

@pcm32 From the PhenoMeNal point of view, workflow execution would require the raw data or mz|nmr.xml files to be available. Duplicating these datafiles may not be that popular for hosting institutions. It depends if hosting site can serve data files separately from ISA metadata files and send them if / when requested as a whole or as a subset based on metadata filtering. @khaug @pcm32 @ilveroluca @RJMW what are your preferred options?

djcomlab commented 6 years ago

@pcm32 Well, I guess it depends how much space the filesystem can handle. The ISA-slicer that produces a new data collection only works by copying files across history items. Which I agree is not particularly efficient.

But as @proccaserra points out, workflow-wise we need the data files there. We don't yet have a tool that only downloads the data files separately from the ISA-Tabs, or even a tool that downloads only a subselection of data files.

pkrog commented 6 years ago

If hard linking is possible on a shared file system, I think this is a good solution with Galaxy. The files are never modified, only deleted, so it should work.

pcm32 commented 6 years ago

Ok, to keep the original functionality, I will add the raw data files addition as an option, then the user can still use mzml2isa as it was originally intended by its authors: a way of generating initial I, S, A files from mzML files, without having to download the raw data inside the ISA folder once created.

pcm32 commented 6 years ago

I will do the same for nmrml2isa, hopefully no one is on that.

pcm32 commented 6 years ago

Ok, I have added support to add the raw data, being optional and by default not enabled to keep the original tool behaviour. My only doubt at this point is that if the user creates the zip file with data with files within a directory, then this directory but not the raw data files are left in the root directory of the ISA archive. One way to get around this would be to run unzip -j apparently, which disregards all directory structure inside the zip file, and just gets the files all on the working directory. However, if the user has some relevant directory structure, it would be lost. What are the best practices in terms of ISA archive creation @djcomlab @proccaserra? Or should we just warn the user how the zip file needs to be created? or provide an option for this (to strip or not) the directory structure?

RJMW commented 6 years ago

@tomnl FYI

@pcm32 Can we add these changes to https://github.com/ISA-tools/mzml2isa-galaxy as well? This to avoid two different gallaxy wrappers for mzml2isa (and nmrml2isa).

pcm32 commented 6 years ago

Of course @RJMW! but I suggest that that happens once there is a merged pull request for Galaxy with the ISA datatype, otherwise this tool won't work on a non-phenomenal Galaxy setup.

pcm32 commented 6 years ago

Maybe we can leave a branch on mzml2isa-galaxy with the changes ready, and then they merge it when the datatype is generally available. I would love to hear @Tomnl's opinion on the raw files inside the ISA archive, and how we should manage the directory structure available inside the zip.

Tomnl commented 6 years ago

hi @pcm32, sorry just catching up now.

The 'include_raw_data_in_output' seams like a good option to have.

Regarding the input zip file:

Or should we just warn the user how the zip file needs to be created? or provide an option for this (to strip or not) the directory structure?

Either of these two options seem fine to me.

In case it is useful to know for reference:

When using a zip file as input, mzml2isa CLI finds all mzML files recursively in the zip file and uses all of the mzML files to create the ISA files.
When using a path to a non-compressed directory as input, mzml2isa CLI will NOT work recursively on the chosen folder.

Good work Pablo!

pcm32 commented 6 years ago

Ok, so basically what you're saying @Tomnl is that mzml2isa disregards the directory structure when using a zip, right? which means I can do the same on the unzip part by using -j. Sounds sorted to me then, thanks @Tomnl, this really simplifies things!!! We should probably clarify that to users though in the help part.

pcm32 commented 6 years ago

By disregard I meant really that mzml2isa CLI doesn't transfer the directory complexity into the ISA archive, but just quotes the mzML files as being in the root directory. Is this the case @Tomnl?

Tomnl commented 6 years ago

By disregard I meant really that mzml2isa CLI doesn't transfer the directory complexity into the ISA archive, but just quotes the mzML files as being in the root directory. Is this the case @Tomnl?

That's correct. Which might cause some problems I suppose if a specific directory structure is required, although my experience has been that all data files are in the same directory when using ISA file formats.

Another thing to consider is vendor instrument RAW files (e.g. thermo, waters etc). Are these required as well?

pcm32 commented 6 years ago

Excellent, thanks @Tomnl, I think that this is a reasonable working assumption. This is ready for reviewing.

pcm32 commented 6 years ago

Yes, on minikube and on openstack deployments!... and now on AWS as well!

pcm32 commented 6 years ago

Thanks for rev @djcomlab!

phnmnl / container-galaxy-k8s-runtime

First working version of mzml2isa/nmrml2isa with isa-tab datatype #145