mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
48 stars 10 forks source link

`.clean_prefix_map()`: `remove_unused_builtins` param #537

Open joeflack4 opened 2 months ago

joeflack4 commented 2 months ago

Overview

Given that built-in prefix maps are not required unless they appear somewhere in the mapping set, it would be nice to have a way to give users the ability to create an msdf and an SSSOM TSV that leaves these out.

caufieldjh commented 1 month ago

I would also like to have this option. Otherwise nearly every SSSOM TSV I create is more prefix map than anything else, by lines.

matentzn commented 1 month ago

@caufieldjh you must have small mapping sets then - there are only about 6 built in prefixes.. I do agree though with this issue, and happy to incorporate a parameter like this.

caufieldjh commented 1 month ago

I was observing the same effect @joeflack4 did - when instantiating a MappingSetDataFrame without a Converter, the included prefix map included all ~1.5K prefixes and the object takes some time to create. If I do it more like this, instantiation is instantaneous:

    metadata = get_default_metadata()
    metadata['curie_map'] = {}
    converter = Converter.from_prefix_map(metadata['curie_map'])
    msdf = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)
matentzn commented 1 month ago

@caufieldjh this is a slightly different ask: you are saying that when you instantiate with just a converter, you would like by default to run "clean prefix map".

I think I can see that could be a good idea - but its not the same as the parameter being asked for here.

caufieldjh commented 1 month ago

I still see this as related - the call to clean_prefix_map doesn't have to be a default, as there are certainly cases in which it's useful to start with a prefix map including builtins, add/merge new mappings into the msdf, then do the final cleanup step to remove unused prefixes (working under the assumption that some of those prefixes are now present in the mappings, too).

joeflack4 commented 1 month ago

I guess it is slightly different but related. I think that actually what @caufieldjh is referring to is actually just a bug, because there are 1,547 prefixes included in the output when that happens.

In any case, just linking this related stuff here: