kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.79k stars 894 forks source link

Datasets collection for pyOpenMS files #1544

Closed Kastakin closed 1 year ago

Kastakin commented 2 years ago

Description

A series of DataSets to work with LC-MS and GC-MS data using the pyOpenMS library.

Context

In the world of biology related analysis Omics represent a very important field in which ML and Data Science play a crucial role. Much of the data pre-processing and subsequent elaboration can be carried out with different tools among which is OpenMS, an open-source set of algorithms and routines that can be accessed from Python thanks to the pyOpenMS bindings.

I started using Kedro to standardize the data pipeline in our laboratories and I've created some custom datasets to interface with the file used to store the spectroscopy data (mainly the universally adopted .mzML format and the OpenMS specific .featureXML).

Possible Implementation

pyOpenMS provides classes to load and store files to disk. I had success wrapping these with fsspec as suggested in the docs, I've checked for thread safety, PartitionedDataSet/IncrementalDataSet compatibility and versioning.

Main Challanges

I would like to have a shot at implementing this dataset type if it's something that could be seen as useful for the project. The main issue is that pyOpenMS, as far as I know, allows only loading and writing files to the local disk using a path string, is it correct to actually raise a DataSetError if the protocol is different from file?

I will try to do test properly my code but I'm kind of a newcomer in this regard so I'm sorry if I'll require some assistance down the road.

noklam commented 2 years ago

Contributions are very welcomed! Please voice if you need some help.

antonymilne commented 2 years ago

This is really cool - always interesting to hear how kedro being used in different fields! Just regarding your question

The main issue is that pyOpenMS, as far as I know, allows only loading and writing files to the local disk using a path string, is it correct to actually raise a DataSetError if the protocol is different from file?

What you said about using fsspec to wrap the pyOpenMS methods sounds like the right approach here. We do this for several datasets, e.g. https://github.com/kedro-org/kedro/blob/main/kedro/extras/datasets/json/json_dataset.py#L132-L142. So it doesn't matter if pyOpenMS natively supports reading/writing on e.g. s3.

Kastakin commented 1 year ago

I lagged behind with this due to being busy with work... I'll try to get on track for Hacktober!

noklam commented 1 year ago

@Kastakin Great to see you back!

Kastakin commented 1 year ago

I am working on a implementation for this dataset, unfortunately the pyOpenMS library is lagging behind with its releases. The currently available version on PyPI does not support Python 3.10. I have it working in my project but I had to resort on downloading the nighlty version of the library from their CI/CD GitHub Actions with a script.

I think it might be a better idea to put this issue on hold and to wait for the proper release of the next version that should include Python 3.10 support.

AhdraMeraliQB commented 1 year ago

Closing this issue as it looks like pyopenms is still not available for 3.10 on PyPI, but we'd still love to have your contribution once that becomes available. Feel free to migrate this issue over to kedro-datasets in the kedro-plugins repository, the new home for our datasets.