phenomecentre / nPYc-Toolbox

The nPYc-Toolbox defines objects for representing, and implements functions to manipulate and display, metabolic profiling datasets.
MIT License
24 stars 8 forks source link

added import for XCMSonline, MZmine, MS-DIAL and nPYc exported data #86

Closed misch91 closed 2 years ago

misch91 commented 2 years ago

Example datasets can be found here: https://drive.google.com/drive/folders/1ITl4pvgvDmTZOv0fytplMd4nXMVtAqqX?usp=sharing

XCMSonline -By default, XCMSonline provides two result tables (.xlsx) as output. One is annotated, the other is not. The annotated one contains extra columns at the end of the table which are named “isotopes”, “adduct” and “pcgroup”. To enable import function dealing with both, I circumvent this issue by adjusting endIndex and adduct metadata with if-clause. -Sadly, the generated xlsx files of untargeted datasets can be (very) large in file size. From my experience I usually get sth. between 30-60 MB per file. This can be even larger if peak picking settings are changed to a more sensitive level. This however causes problems with pandas.read_excel() function as it cannot read xlsx files chunkwise and ends in memory problems. With my datasets it even failed at all, returning an empty dataframe. I tried all 4 different engines but without success. Browsing through stackoverflow revealed this is indeed a general issue and the easiest way out is converting the original file to csv format. As a consequence, nPYc users must do this xlsx -> csv conversion manually beforehand (I incorporated advices inside function info but this definitely needs to be addressed in the documentation & tutorial). I know it seems a little drawback but I hope I can expect future users to be literate enough to do so. -Another XCMSonline issue occurs with the metadata file. Apparently, XCMSonline changes all Sample File Names to lower case throughout the process. This took me a while to realize after my self-made metadata csv file did not match using the addSampleInfo() function (it seems to be case-sensitive). Two options appear possible here imo: a) Documentation/tutorial should point this out or b) generalize it for all data types so that addSampleInfo() automatically applies to_lower_case() function. What’s your opinion on this?

MS-DIAL -First of all the software provides many different options to choose from when exporting data (similar to Progenesis QI), although “Raw peak area” option is the one we should aim for (Raw peak height might also be interesting for some folks). The output txt file then comes with a lot of information (see example files), ranging from statistical parameters over identifications to MS/MS peak info – good thing for us is that these columns are consistent even if MS/MS data was not even acquired. I applied Occam’s razor and picked only RT, m/z and Area (or height) values. -Interestingly, file type (nPYc equivalent: AssayRole/Sample Type), Injection Order (Run Order) and Batch ID (Correction Batch) are also provided if provided by the user beforehand. I tried to implement an extraction of these metadata as well but I have honestly no idea how to check if the code block works (please have a try). For some folks, this may already be enough, so I stated “self.sampleMetadata['Metadata Available'] = True” already, however others may need to provide more metadata for, let’s say, dilution etc. I expect that the remaining metadata can later on be added with the usual addSampleInfo() function. Are yet existing metadata being overwritten then?

MZmine -MZmine output looks a bit confusing at first sight (see example file) but can be easily filtered to what we need. Also, peak width and RTmin/max and MZmin/max are provided but for each sample. I therefore had them calculated by mean (as gap filling was applied there was a value in all cases, I do not know how it goes without gap filling. I assume MZmine outputs “null” for this and had this checked manually. Also, values are imported as float directly). -Btw: How does XCMS calculate RTmin and RTmax, is it the true minimum/maximum RT ever measured in any sample or a mean/median value of all RT min/max values?

nPYc reimport -I tried to avoid hardcoding here as much as possible as export csv style strongly depends on the software that was used for preprocessing the dataset in the first place. The featureMetadata block thus contain many if’s, about the sampleMetadata block I am not 100% sure if it’s useful for the other modules. Please have a look.