mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
48 stars 10 forks source link

`MappingSetDataFrame` issues when no `converter` passed #513

Open joeflack4 opened 3 months ago

joeflack4 commented 3 months ago

Overview

I just noticed several issues with that arise when using MappingSetDataFrame and not passing any Converter along with it.

Sub-issues

Sub-issue details

1. Performance

2. Incorrect curie_map

FYI:

2.1. Missing entries

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

2.1.1. Add test

Prepare test w/ input file(s) with one row & Python code used to invoke. Possibly another test substituting the Python code for a command, if that is possible.

2.2. Too many entries

Compare the following results.

  1. n=4. This is the amount of entries I want, and it is the amount that I get when using some custom code I wrote. ordo-icd11.sssom - joes ad hoc way.tsv.zip
  2. n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame. I also mentioned this in #514. Can be addressed by #537. See example: ordo-icd11.sssom - with converter.tsv.zip
  3. n=1,547. This is the amount of entries I get when instantiating a MappingSetDataFrame passing in my metadata but no Converter. See example: ordo-icd11.sssom - no converter.tsv.zip
2.2.1. Add test

3. UX: Should automatically instantiate Converter

I haven't fully thought through possible downsides of this, but I recommend that when MappingSetDataFrame is instantiated using only a df and metadata, but no converter, it should instantiate the converter via curies.Converter.from_prefix_map(metadata['curie_map']).

matentzn commented 3 months ago

Thank you for the issue, responses will be a bit piecemeal:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?

n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame.

Are you sure the remaining 6 are not the build in sssom prefixes?

No converter: msdf = MappingSetDataFrame(df=df, metadata=metadata): Takes 33.206355 seconds

Wow - long time!

joeflack4 commented 3 months ago

i. Missing prefix

Joe:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

Nico:

Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?

This happens whether or not I call msdf.clean_prefix_map() or msdf.clean_context(). And yep, every row in the data frame has the prefix.

ii. Built-in SSSOM prefixes

Yep, they are. My preference is to not have them there, but I can understand why we'd want to include them; maybe it is better.

How about this. Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

matentzn commented 3 months ago

This happens whether or not I call msdf.clean_prefix_map() or msdf.clean_context(). And yep, every row in the data frame has the prefix.

i. Bug.. We need a minimal test for this. Can you prepare one (input file(s) with one row, command used to invoke)?

How about this. Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

ii. Can you make a Documentation request in https://github.com/mapping-commons/sssom/issues (model repo), asking for clarification if built-in prefixes are required to appear in the metadata or just recommended? If it ends up as recommended, I agree with you, we will add your flag.

joeflack4 commented 3 months ago

(i) Will do! FYI though, I'm not running a command. This is the Python API where this is a problem. Though perhaps this problem can also arise in the CLI.

(ii) Will do!

(iii) Instantiating Converter correctly In #514 you wrote that the Converter should already be being instantiated automatically by MappingSetDataFrame when it is not explicitly passed. Perhaps that is the case, but something appears to be wrong with that instantiation. My thought (sub-issue (3) in the OP) is that this should happen early on in the class's construction, and that it should do curies.Converter.from_prefix_map(metadata['curie_map']).

matentzn commented 3 months ago

Look at the default values: https://github.com/mapping-commons/sssom-py/blob/master/src/sssom/util.py#L96

By default, it is constructed with a converter! At least if I understand the dataclass syntax correctly.

joeflack4 commented 3 months ago

I don't know enough about dataclasses to be a huge help here, but...

We know that it is not instantiating and/or using the instantiated Converter correctly because:

  1. The results (e.g. serialized curie_map) are different when in one case you pass the Converter, and in another case you don't. Results in both (i) entries missing, and (ii) entries added. I updated the OP such that we can write a test for both of these cases.
  2. It takes longer to create the msdf when you don't pass the converter. Precisely 4,150,794.375 times longer in my 1 recent case.

https://github.com/mapping-commons/sssom-py/blob/01965f4f0018ef45bd3ececc0f9015fdeec4a849/src/sssom/context.py#L27-L30

I'd suggest possibly:

Add param:

curie_map: Dict = None

Update:

return curies.Converter.from_prefix_map(metadata['curie_map']) if curie_map \
    else curies.chain([_get_built_in_prefix_map(), _get_default_converter()])

Delete: @lru_cache(1) Though you could move else curies.chain([_get_built_in_prefix_map(), _get_default_converter()] into a separate function and cache that, but it should be pretty fast so I don't know that the cache is very useful, and IDK if it might create problems. Probably not, but if the performance gains are nominal, then I'd still suggest dropping it.

Then it looks like this needs to be updated to pass metadata['curie_map'] if 'curie_map' in metadata. IDK how but I'd imagine ChatGPT does.

matentzn commented 3 months ago

Unfortunately we dont have Harshad for a while now, so we will have to sort this ourselves.

I am not enthusiastic about any of your proposed solutions, and the problem space is too large for me to reason over in the short time I have per issue.

Do you agree that by far the most critical issue you mention here is:

In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.

? Can you add a test to https://github.com/mapping-commons/sssom-py/blob/master/tests/test_utils.py that demonstrates the problem?

joeflack4 commented 3 months ago

I understand this isn't something you can tackle deeply right now.

We could also do:

convs = [_get_built_in_prefix_map(), _get_default_converter()]
if curie_map:
    convs = [curies.Converter.from_prefix_map(metadata['curie_map'])] + convs
return curies.chain(convs)

Yep, adding a test is in the OP tasks / on my schedule.

joeflack4 commented 2 months ago

Nico wrote:

Can you make a Documentation request...

Done!

joeflack4 commented 2 months ago

Built-in SSSOM prefixes

@matentzn Given that these built-ins are not required, do you mind if I make an issue for this suggestion?:

Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.

edit: Created the issue:

matentzn commented 2 months ago

Sure, feel free to make an issue, if the PR is super small, I we can release it fairly quickly.