levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

[Pitch] Include unimod with the release of pyteomics #82

Closed jspaezp closed 1 year ago

jspaezp commented 1 year ago

I have noticed that when using the unimod interface, the default is to download the database. I would like to propose to include the database with the release.

It would greatly improve the speed in using the functionality.

let me know what you think! best Sebastian

mobiusklein commented 1 year ago

Not that it is a bad thing to do (I do it in other libraries, specifically with Unimod and other controlled vocabularies in psims), but both the in-memory and SQLite3-based Unimod interfaces let you pass the Unimod type object at a path or file-like object containing the database's XML representation. If load-time is a concern, you can store a copy of the database locally and/or packaged with your other code.

In the case of the SQLite3-based interface, this is encouraged since it's all for the persistent on-disk representation.

levitsky commented 1 year ago

I'd like to hear opinions: do we need to better document the possibility of using a local copy of Unimod, or is it not convenient enough?

jspaezp commented 1 year ago

Hey there!

Going back to the docs to see what I missed I realized 2 main things.

I missed the documentation where passing a path is possible to the unimod class instantiation. (in my defense, it is really deep between a lot or other text) here: https://pyteomics.readthedocs.io/en/latest/api/unimod.html#pyteomics.mass.unimod.Unimod

On the other hand, I was using it as part of the proforma parser, which uses the UnimodResolver object, which internally loads an Unimod object that does not take any arguments. https://github.com/levitsky/pyteomics/blob/608ae01e111093b36f8ab22c8d0aefcb74afc01a/pyteomics/proforma.py#L401-L408

so ... Would I need to subclass the resolver, over-write the load_database method so the instance of the unimod object reads from disk? (this feels overly complicated).

On the other hand, I am having a hard time thinking of an use case where one would not want the database to be available in disk instead of downloading it in every instantiation.

So I guess, ... to summarize it ...

  1. what is the rationale behind having it downloaded instead of bundled with the python package?
  2. Docs could be improved by moving the documentation on the init(self, path) section of the Unimod class to a more prevalent location (maybe even add an example), lmk if you would like some help with that.
  3. the proforma modification resolver does not have that option of pointing to a disk location. (feature request??)

let me know what you think! as always thank you very much for the help and reply!

mobiusklein commented 1 year ago

Yes, the ProForma interface doesn't support overriding the location. It uses psims for everything but Unimod. The workaround currently would be to do something like this:

from pyteomics import mass, proforma

unimod = mass.Unimod("path/to/unimod.xml")
proforma.UnimodModification.resolver._database = unimod

If we used the psims implementation of Unimod, everything would route through psims.controlled_vocabulary.obo_cache which you could enable at the start of your script (which if start-up time is a concern and you're using the other ontologies, you might wish to do anyway).

To keep using mass.Unimod, we can add a proforma.set_unimod_path method which will do exactly what that snippet above does.

jspaezp commented 1 year ago

Ohhh I see!! Thank you very much for the code snippet of the workaround!

I do not believe I totally understand what yo mean with the psims implementation of unimod (I am still becoming familiar with the API).

Regardless, Thank you very much for the help so far! Best, Sebastian

levitsky commented 1 year ago

It looks to me like we need a documented method to instantiate the resolver with a local copy (item 3 from Sebastian's comment), as well as better documentation (item 2).

As for item 1, I'm not sure how much sense it makes to not ship a bundled copy of unimod, but I always tended to keep "strict" dependencies to a minimum, to allow the installation to be as slim as possible. This is obviously a matter of trade-off, and we can include a unimod XML file, but it doesn't mean we should discard the functionality allowing to specify an alternative location.

levitsky commented 1 year ago

Thanks to @mobiusklein, #85 implements the API for working with a local copy of Unimod when parsing ProForma: https://pyteomics.readthedocs.io/en/latest/api/proforma.html#cv-disk-caching

Personally I feel that this solves the core issue. @jspaezp what do you think?

jspaezp commented 1 year ago

Yes i believe it does! This is an amazing addition, thank you very much