AnnData automatically converts int index to str, therefore matching some Gene synonyms

felix0097 commented 3 weeks ago

Report

ln.Curator.from_anndata matches wrong gene symbols. For an AnnData object where the var index are just integers, the Curator matches SCARNA10 and EGR1. Which is a bit odd as the gene names are just integers in this case.

That's the output of curator.validate:

• saving validated records of 'var_index'
✓ added 2 records from public with Gene.symbol for var_index: 'SCARNA10', 'EGR1'
• mapping var_index on Gene.symbol
!    998 terms are not validated: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
False

Here's an example on how to reproduce the bug:

!lamin init --storage ./lamin-intro --schema bionty


import anndata
import numpy as np
import pandas as pd
import lamindb as ln
import bionty as bt

adata = anndata.AnnData(
    X=np.ones((10, 1000)),
    var=pd.DataFrame(index=range(1000))
)

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    organism="human",
)
curator.validate()

Version information

anndata 0.10.9 bionty 0.52.0 lamindb 0.76.15 numpy 2.1.2 pandas 2.2.3 session_info 1.0.0

annotated_types 0.7.0 anyio NA appdirs 1.4.4 appnope 0.1.4 arrow 1.3.0 asgiref 3.8.1 asttokens NA attr 24.2.0 attrs 24.2.0 babel 2.16.0 botocore 1.35.51 certifi 2024.08.30 chardet 5.2.0 charset_normalizer 3.4.0 click 8.1.7 comm 0.2.2 cython_runtime NA dateutil 2.9.0.post0 debugpy 1.8.7 decorator 5.1.1 deprecation 2.1.0 dj_database_url NA django 5.1.2 dotenv NA exceptiongroup 1.2.2 executing 2.1.0 fastjsonschema NA fastobo 0.12.3 filelock 3.16.1 fqdn NA fsspec 2024.10.0 gotrue 2.8.1 graphlib NA h11 0.14.0 h2 4.1.0 h5py 3.12.1 hpack 4.0.0 httpcore 1.0.6 httpx 0.27.2 hyperframe 6.0.1 idna 3.10 ipykernel 6.29.5 isoduration NA jedi 0.19.1 jinja2 3.1.4 jmespath 1.0.1 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema_specifications NA jupyter_events 0.10.0 jupyter_server 2.14.2 jupyterlab_server 2.27.3 lamin_utils 0.13.7 lamindb_setup 0.80.0 lnschema_core 0.76.1 markupsafe 3.0.2 natsort 8.4.0 nbformat 5.10.4 overrides NA packaging 24.1 parso 0.8.4 pexpect 4.9.0 platformdirs 4.3.6 postgrest 0.13.2 prometheus_client NA prompt_toolkit 3.0.48 pronto 2.5.5 psutil 6.1.0 psycopg2 2.9.10 (dt dec pq3 ext lo64) ptyprocess 0.7.0 pure_eval 0.2.3 pyarrow 18.0.0 pydantic 2.9.2 pydantic_core 2.23.4 pydantic_settings 2.6.0 pydev_ipython NA pydevconsole NA pydevd 3.1.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.18.0 pythonjsonlogger NA pytz 2024.2 realtime 1.0.6 referencing NA requests 2.32.3 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rich NA rpds NA scipy 1.14.1 send2trash NA six 1.16.0 sniffio 1.3.1 sqlparse 0.5.1 stack_data 0.6.3 storage3 0.5.5 strenum 0.4.15 supabase 2.2.1 supafunc NA tornado 6.4.1 traitlets 5.14.3 typing_extensions NA upath 0.2.5 uri_template NA urllib3 2.2.3 wcwidth 0.2.13 webcolors 24.8.0 websocket 1.8.0 websockets 12.0 yaml 6.0.2 zmq 26.2.0 zoneinfo NA

IPython 8.29.0 jupyter_client 8.6.3 jupyter_core 5.7.2 jupyterlab 4.2.5

Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] macOS-10.16-x86_64-i386-64bit

Session information updated at 2024-10-31 10:38

Zethson commented 2 weeks ago

Okay so unfortunately that's a synonyms issue:

EGR1 has the synonym 225
SCARNA10 has the synonym 52

@felix0097 We can consider dropping those synonyms because I really don't know who'd ever want to use numbers as gene names. @sunnyosun is dropping them the best idea? WDYT?

falexwolf commented 2 weeks ago

Why not deal with this based on type comparison?

Re-phrase: Why don't we just stop casting integers to strings? It seems dangerous and unnecessary.

Zethson commented 2 weeks ago

Ah yeah, that's a fair point. I'll have a look. Thanks!

In any case, it's weird to have numbers in there in the first place.

falexwolf commented 2 weeks ago

Ah yeah, that's a fair point. I'll have a look. Thanks!

Great! I hope that's really simple. It's a fundamental issue with the code that I think should be fixed & tested soon.

In any case, it's weird to have numbers in there in the first place.

Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.

Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.

Zethson commented 2 weeks ago

Here's the issue:

adata = anndata.AnnData(
    X=np.ones((10, 1000)),
    var=pd.DataFrame(index=range(1000))
)

gives us a str (categorical) index. This is documented behavior (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html):

Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.

I can perform type checks and see whether the input is "number like" but it's of course not perfect...

Zethson commented 2 weeks ago

I talked a lot with Phil about this. A perfect technical solution seems out of reach but maybe we can improve the situation at least. Ideas:

Doing nothing and considering this a weird edge case. Phil says that type checking could become easier when AnnData switches to UUIDs.
Attempting to guess whether the input identifiers are "number like" and then type checking.
Removing these weird numbers from the gene ontology. I highly doubt that other ontologies have this issue or that these numbers are even useful. Who uses such gene symbol names ?!?
Being really opinionated and always converting gene symbols to ensembl ids and then validating these. We could do that internally or even be super opinionated and modify the input AnnData object then.

felix0097 commented 2 weeks ago

Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.

Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.

Agree those ontologies are often noisy and it's pretty much impossible for us to clean them (that would probably be a huge community effort). However, I would say that it can be dangerous to just plainly trust them then, e.g. auto-matching to ontology terms without having a human in the loop or at least without giving the option to check them manually (again this is very hard if you have 40k genes to check). I only realized it in that case, cause I was expecting no matches.

Some odd matches, might lead to some weird behavior upstream, which is pretty much impossible to debug or at least very hard.

falexwolf commented 2 weeks ago

Proposed solution: Why does .validate() validate against synonyms? I think it shouldn't unless the user explicitly asks for it (even though I think that even then this is dangerous; I'd rather not make this possible). Hence, how about removing it and we'll no longer have this problem. I remember having multiple discussions with @sunnyosun about this but can't recall why we would consider a synonym "valid".

Background notes:

Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.

Darn, I forgot about this. For many cases, the magical [] that relies on casting any index to str makes things easier for users. But yeah, one should have not done the casting and rather added .iloc for those cases in which people pass integer indexes. Isaac was very unhappy about it.

easier when AnnData switches to UUIDs

Interesting. My last discussion with Isaac about this dates 3 years back I guess. Back then I'd have just removed the casting and added .iloc. But well.

sunnyosun commented 2 weeks ago

These synonyms are not noise, and I believe ensembl is one of the most curated ontologies as everyone uses them. Also, other ontology sources we use are reasonably well-curated.

For instance: 225 is an actual synonym of EGR1:

The current Curator does require a human in the loop and it's not possible to .save_artifact without being validated, so I don't think there's a danger as long as users aren't blindly running .add_new_from().

sunnyosun commented 2 weeks ago

Proposed solution: Why does .validate() validate against synonyms?

Because this is a high-level function meant to require minimal work from users, if we ditch synonym mapping, this curation process will get longer and require more work.

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

falexwolf commented 2 weeks ago

Ok, so, we're back to the big box that clarifies what "validated" means. We need this box and you should add it ASAP @sunnyosun @Zethson to this page: https://docs.lamin.ai/curate

I'm then almost certain that most users and we as data engineers don't want to consider a synonym to be a validated value.

This is the whole point of .standardize(). If you don't call .standardize(), your values will not validate when they contain synonyms.

falexwolf commented 2 weeks ago

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

I think we need a solution that ascertains a meaningful definition of what a "validated dataset" is. As I said, I'm almost entirely sure that a validated dataset should not contain synonyms of anything.

In the second step we need to make the curation process bearable and concise. Calling .standardize() under the hood is something one can consider, but also seems somewhat dangerous.

falexwolf commented 2 weeks ago

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

I wasn't aware of this change.

As it looks to me right now, this is a great danger for the integrity of the lakehouse and should be reverted ASAP.

Apologies if I'm misunderstanding things.

falexwolf commented 2 weeks ago

We already have an "Example" box here: https://docs.lamin.ai/curate

Imagine the "Experiment 1" label has a synonym "Experiment_1".

Do we want to consider a "Experiment_1" valid value under ULabel.name? I don't think so.

Of course we can call it valid under ULabel.synonyms but that's not what people should be using for curating data. To Curator classes, we pass constraints along the lines of Gene.ensembl_gene_id and not constraints that involve synonyms.

laminlabs / lamindb