Open joeflack4 opened 3 months ago
Thank you for the issue, responses will be a bit piecemeal:
In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.
Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?
n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame.
Are you sure the remaining 6 are not the build in sssom prefixes?
No converter: msdf = MappingSetDataFrame(df=df, metadata=metadata): Takes 33.206355 seconds
Wow - long time!
Joe:
In my
curie_map
in mymetadata
dictionary, I have some prefixes (e.g.icd11.foundation
that do not show up in the output file.Nico:
Is this after msdf.clean()? If so, is there at least 1 entity in the data frame that has that prefix?
This happens whether or not I call msdf.clean_prefix_map()
or msdf.clean_context()
. And yep, every row in the data frame has the prefix.
Yep, they are. My preference is to not have them there, but I can understand why we'd want to include them; maybe it is better.
How about this. Can we consider adding a remove_unused_builtins
param to msdf.clean_prefix_map()
for people like me? Really low priority anyway.
This happens whether or not I call msdf.clean_prefix_map() or msdf.clean_context(). And yep, every row in the data frame has the prefix.
i. Bug.. We need a minimal test for this. Can you prepare one (input file(s) with one row, command used to invoke)?
How about this. Can we consider adding a remove_unused_builtins param to msdf.clean_prefix_map() for people like me? Really low priority anyway.
ii. Can you make a Documentation request in https://github.com/mapping-commons/sssom/issues (model repo), asking for clarification if built-in prefixes are required to appear in the metadata or just recommended? If it ends up as recommended, I agree with you, we will add your flag.
(i) Will do! FYI though, I'm not running a command. This is the Python API where this is a problem. Though perhaps this problem can also arise in the CLI.
(ii) Will do!
(iii) Instantiating Converter
correctly
In #514 you wrote that the Converter
should already be being instantiated automatically by MappingSetDataFrame
when it is not explicitly passed. Perhaps that is the case, but something appears to be wrong with that instantiation. My thought (sub-issue (3) in the OP) is that this should happen early on in the class's construction, and that it should do curies.Converter.from_prefix_map(metadata['curie_map'])
.
Look at the default values: https://github.com/mapping-commons/sssom-py/blob/master/src/sssom/util.py#L96
By default, it is constructed with a converter! At least if I understand the dataclass syntax correctly.
I don't know enough about dataclasses to be a huge help here, but...
We know that it is not instantiating and/or using the instantiated Converter
correctly because:
curie_map
) are different when in one case you pass the Converter
, and in another case you don't.
Results in both (i) entries missing, and (ii) entries added. I updated the OP such that we can write a test for both of these cases.msdf
when you don't pass the converter.
Precisely 4,150,794.375 times longer in my 1 recent case.I'd suggest possibly:
Add param:
curie_map: Dict = None
Update:
return curies.Converter.from_prefix_map(metadata['curie_map']) if curie_map \
else curies.chain([_get_built_in_prefix_map(), _get_default_converter()])
Delete: @lru_cache(1)
Though you could move else curies.chain([_get_built_in_prefix_map(), _get_default_converter()]
into a separate function and cache that, but it should be pretty fast so I don't know that the cache is very useful, and IDK if it might create problems. Probably not, but if the performance gains are nominal, then I'd still suggest dropping it.
Then it looks like this needs to be updated to pass metadata['curie_map']
if 'curie_map' in metadata
. IDK how but I'd imagine ChatGPT does.
Unfortunately we dont have Harshad for a while now, so we will have to sort this ourselves.
I am not enthusiastic about any of your proposed solutions, and the problem space is too large for me to reason over in the short time I have per issue.
Do you agree that by far the most critical issue you mention here is:
In my curie_map in my metadata dictionary, I have some prefixes (e.g. icd11.foundation that do not show up in the output file.
? Can you add a test to https://github.com/mapping-commons/sssom-py/blob/master/tests/test_utils.py that demonstrates the problem?
I understand this isn't something you can tackle deeply right now.
We could also do:
convs = [_get_built_in_prefix_map(), _get_default_converter()]
if curie_map:
convs = [curies.Converter.from_prefix_map(metadata['curie_map'])] + convs
return curies.chain(convs)
Yep, adding a test is in the OP tasks / on my schedule.
Can you make a Documentation request...
Done!
@matentzn Given that these built-ins are not required, do you mind if I make an issue for this suggestion?:
Can we consider adding a
remove_unused_builtins
param tomsdf.clean_prefix_map()
for people like me? Really low priority anyway.
edit: Created the issue:
Sure, feel free to make an issue, if the PR is super small, I we can release it fairly quickly.
Overview
I just noticed several issues with that arise when using
MappingSetDataFrame
and not passing anyConverter
along with it.Sub-issues
curie_map
Converter
Sub-issue details
1. Performance
msdf = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)
: Takes 0:00:00.000008 secondsmsdf = MappingSetDataFrame(df=df, metadata=metadata)
: Takes 33.206355 seconds2. Incorrect
curie_map
FYI:
metadata
: icd11.sssom-metadata.yml.zip2.1. Missing entries
In my
curie_map
in mymetadata
dictionary, I have some prefixes (e.g.icd11.foundation
that do not show up in the output file.2.1.1. Add test
Prepare test w/ input file(s) with one row & Python code used to invoke. Possibly another test substituting the Python code for a command, if that is possible.
2.2. Too many entries
Compare the following results.
Converter
and passing that toMappingSetDataFrame
. I also mentioned this in #514. Can be addressed by #537. See example: ordo-icd11.sssom - with converter.tsv.zipMappingSetDataFrame
passing in mymetadata
but noConverter
. See example: ordo-icd11.sssom - no converter.tsv.zip2.2.1. Add test
3. UX: Should automatically instantiate
Converter
I haven't fully thought through possible downsides of this, but I recommend that when
MappingSetDataFrame
is instantiated using only adf
andmetadata
, but noconverter
, it should instantiate the converter viacuries.Converter.from_prefix_map(metadata['curie_map'])
.