Define an API - Githubissues

fdiblen commented 4 years ago

We need to identify project specific parts and generic parts of the software.

Separate generic from specific.
Separating data handling from algorithmic code.
Discuss first before Florian for the actual refactoring: i.e. separating generic and specific code.

Specific here means either for mass spectra or genes. It should be entered into a library.

jspaaks commented 4 years ago

API what does a user want to be able to do vs. what can we already do?

import mass spectrometry data
- from local file
- from remote libraries of reference spectrums
harmonize mass spectrometry data
- add missing SMILES
- add missing InChI key
- align undefined values: "N/A", "n/a", "", null, etc
- ...
convert spectra to sentences: each peak becomes a word like peak_103.44
search (based on similarity?)
- table lookup, where rows are known spectra, columns are similarity measures; table elements express similarity of a given spectrum to a reference spectrum according to a given similarity measure
calculate let similarity = (spectrum, reference_spectrum) => { }
- classical way 1: structural similarity
- classical way 2: Tanimoto (this the default type for structural similarity)
- classical way 3: cosine
- classical way 4: modified cosine
- new way: based on word2vec-like analysis (~~separation distance in N-space?~~ cosine between spectrum vectors)
generate plots elucidating
- the inner workings of the library
- results generated with the library
train model
output external data
Anything to be added by Florian

HannoSpreeuw commented 4 years ago

I would go for a slightly different description of the general purpose of the library: it is not about discovering data, but about classifying microbes, by comparing their mass spectra with mass spectra of known microbes. This is how you can identify a microbe.

There is no such thing as generated data from a single microbe in Spec2Vec. The input for Spec2Vevc always consists of mass spectra from a known and an unknown microbe.

@florian-huber please correct me if I am wrong.

fdiblen commented 4 years ago

@HannoSpreeuw Let's focus on the API design in this issue and keep the description discussion for team meeting.

jspaaks commented 4 years ago

I updated my summary above so it doesn't talk about the general purpose of the library anymore, so we can focus on the actual API

HannoSpreeuw commented 4 years ago

Essential for the API is to know if the input consists of one or two datasets (sequences). I will try to figure that out.

fdiblen commented 4 years ago

There is a flowchart made by @florian-huber code_flowchart.pdf

CunliangGeng commented 4 years ago

calculate let similarity = (spectrum, reference_spectrum) => { }

classical way 1: structural similarity

classical way 2: Tanimoto (this the default type for structural similarity)

classical way 3: cosine

classical way 4: modified cosine

new way: based on word2vec-like analysis (~separation distance in N-space?~ cosine between spectrum vectors)

These functions are in two scripts MS_similarity_classical.py and similarity_measure.py.

The MS_similarity_classical.py is used for comparison with new similarity measures. Is this script useful for other users? Should we keep it? Should we combine it with similarity_measure.py?

fdiblen commented 4 years ago

If they are identical, it makes sense to remove one of them and do import from the other.

CunliangGeng commented 4 years ago

If they are identical, it makes sense to remove one of them and do import from the other.

no, they are totally different. MS_similarity_classical.py seems MS-specific and implements the classical way 1-4; similarity_measure.py should be generic and implements the new way.

jspaaks commented 4 years ago

With the new layout (https://github.com/matchms/matchms/pull/47), I think this is clear enough for the moment. Naturally, the API will see some smaller changes in th enear future but that's no reason to keep this issue open forever.

Refer to my comments

https://github.com/matchms/matchms/issues/38#issuecomment-604899137
https://github.com/matchms/matchms/issues/41#issue-588260184 to see why the module was laid out like it is, and how users are supposed to use it.

This issue should be closed.

HannoSpreeuw commented 4 years ago

Don't you think we still need a document defining the API?

florian-huber commented 4 years ago

@HannoSpreeuw I added some comments on @jspaaks suggestions for a new API to #41

jspaaks commented 4 years ago

Re: https://github.com/matchms/matchms/issues/7#issuecomment-606640308 if we can think of a use case in which such a document would be the solution, yes, otherwise no.

Maybe we could simply update or expand __init__.py?

fdiblen commented 4 years ago

We discussed this during the meeting with Stefan and Hanno and decided to close it. The information in this issue was split in to several issues.

matchms / matchms-backup

Define an API #7