mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
49 stars 12 forks source link

What to do when having a uri-prefix clash? #473

Closed matentzn closed 10 months ago

matentzn commented 10 months ago

Note: If you remove either one of the two prefixes, it works

cthoyt commented 10 months ago

Okay this is a good demo. Do you think it will be possible to come up with a consistent set of rules for this kind of merge where in your higher priority EPM, you have two disparate records that a lower priority EPM in the chain would merge together?

matentzn commented 10 months ago

Lets first capture this: This situation causes a genuine clash. The user is providing conflicting information because no matter how you play it, either the uri_prefix or curie_prefix will end up not conforming to the users intended preference. There are three strategies I can see here:

  1. We drop the clashing entry from the epm when the clash is detected and let the two clashing entry enter the union separately
  2. We apply first the first prefix, then the second prefix (sequentially) by the users, and spit out a warning that the preferred uri/curie prefix combo ended up X after there being a clash
  3. We provide a detailed warning saying exactly why the clash occurred (all the clashing records) and how to resolve the clash.

All have some good arguments for and against, the main thing against 1 is that it is nearly guaranteed that the split is unintentional (in fact, the two prefixes provided are probably OLD pre-EPM ingest code whose intention was to be an EPM). So I kinda don't like that 1. Two is ok, but only if we can agree on a deterministic ordering (there could be many prefix combinations participating in a clash). Something like: First apply all prefixes to the the EPM. Then all uri prefixes.

All in all, I think the correct thing in this case would be that

sssom-py CLI fails (solution 3)

matentzn commented 10 months ago

@hrshdhgd lets make sure that the test fails (assertRaises) and maybe wrap the exception in something more progressive that helps users actually dealing with the clash.

Also we should make sure that the raised exception mentions the clash. I don't know how hard/easy that is, but for now I would just make an issue in the curies repo asking for a messages that "involves all prefixes that are part of the clash", not just the last one.

cthoyt commented 10 months ago

sorry all I have not had time to write an excellent tutorial on this yet. I would like you to let curies do the hard work of documenting this to make sure that it's absolutely bulletproof. Contributions welcome if you want to get started solving this issue upstream in a PR there.

matentzn commented 10 months ago

I would like you to let curies do the hard work of documenting this to make sure that it's absolutely bulletproof.

Amazing! Thank you!

I don't want to sound ungrateful but I think its better @hrshdhgd focuses on getting the 0.4 release of sssom py out the door now; what is there is, after all, quite close to good enough for the release. Hope that is ok?

hrshdhgd commented 10 months ago

@matentzn : Test should pass now.