Closed pcm32 closed 6 years ago
@pcm32 From the PhenoMeNal point of view, workflow execution would require the raw data or mz|nmr.xml files to be available. Duplicating these datafiles may not be that popular for hosting institutions. It depends if hosting site can serve data files separately from ISA metadata files and send them if / when requested as a whole or as a subset based on metadata filtering. @khaug @pcm32 @ilveroluca @RJMW what are your preferred options?
@pcm32 Well, I guess it depends how much space the filesystem can handle. The ISA-slicer that produces a new data collection only works by copying files across history items. Which I agree is not particularly efficient.
But as @proccaserra points out, workflow-wise we need the data files there. We don't yet have a tool that only downloads the data files separately from the ISA-Tabs, or even a tool that downloads only a subselection of data files.
If hard linking is possible on a shared file system, I think this is a good solution with Galaxy. The files are never modified, only deleted, so it should work.
Ok, to keep the original functionality, I will add the raw data files addition as an option, then the user can still use mzml2isa as it was originally intended by its authors: a way of generating initial I, S, A files from mzML files, without having to download the raw data inside the ISA folder once created.
I will do the same for nmrml2isa, hopefully no one is on that.
Ok, I have added support to add the raw data, being optional and by default not enabled to keep the original tool behaviour. My only doubt at this point is that if the user creates the zip file with data with files within a directory, then this directory but not the raw data files are left in the root directory of the ISA archive. One way to get around this would be to run unzip -j
apparently, which disregards all directory structure inside the zip file, and just gets the files all on the working directory. However, if the user has some relevant directory structure, it would be lost. What are the best practices in terms of ISA archive creation @djcomlab @proccaserra? Or should we just warn the user how the zip file needs to be created? or provide an option for this (to strip or not) the directory structure?
@tomnl FYI
@pcm32 Can we add these changes to https://github.com/ISA-tools/mzml2isa-galaxy as well? This to avoid two different gallaxy wrappers for mzml2isa (and nmrml2isa).
Of course @RJMW! but I suggest that that happens once there is a merged pull request for Galaxy with the ISA datatype, otherwise this tool won't work on a non-phenomenal Galaxy setup.
Maybe we can leave a branch on mzml2isa-galaxy with the changes ready, and then they merge it when the datatype is generally available. I would love to hear @Tomnl's opinion on the raw files inside the ISA archive, and how we should manage the directory structure available inside the zip.
hi @pcm32, sorry just catching up now.
The 'include_raw_data_in_output' seams like a good option to have.
Regarding the input zip file:
Or should we just warn the user how the zip file needs to be created? or provide an option for this (to strip or not) the directory structure?
Either of these two options seem fine to me.
In case it is useful to know for reference:
Good work Pablo!
Ok, so basically what you're saying @Tomnl is that mzml2isa disregards the directory structure when using a zip, right? which means I can do the same on the unzip part by using -j
. Sounds sorted to me then, thanks @Tomnl, this really simplifies things!!! We should probably clarify that to users though in the help part.
By disregard I meant really that mzml2isa CLI doesn't transfer the directory complexity into the ISA archive, but just quotes the mzML files as being in the root directory. Is this the case @Tomnl?
By disregard I meant really that mzml2isa CLI doesn't transfer the directory complexity into the ISA archive, but just quotes the mzML files as being in the root directory. Is this the case @Tomnl?
That's correct. Which might cause some problems I suppose if a specific directory structure is required, although my experience has been that all data files are in the same directory when using ISA file formats.
Another thing to consider is vendor instrument RAW files (e.g. thermo, waters etc). Are these required as well?
Excellent, thanks @Tomnl, I think that this is a reasonable working assumption. This is ready for reviewing.
Yes, on minikube and on openstack deployments!... and now on AWS as well!
Thanks for rev @djcomlab!
This PR introduces the use of isa-tab datatype as the main result format for mzml2isa. Currently only missing to add the raw data on the isa-tab archive created, I wonder if we should actually do this, as currently the result of the tool were only the I, S, A files. Adding data means we duplicate the data on the server (unless that we do some hard linking I guess).
@djcomlab @proccaserra @sneumann @pkrog @ilveroluca what do you think?