Open felix0097 opened 1 week ago
Okay so unfortunately that's a synonyms issue:
EGR1
has the synonym 225
SCARNA10
has the synonym 52
@felix0097 We can consider dropping those synonyms because I really don't know who'd ever want to use numbers as gene names. @sunnyosun is dropping them the best idea? WDYT?
Why not deal with this based on type comparison?
Re-phrase: Why don't we just stop casting integers to strings? It seems dangerous and unnecessary.
Ah yeah, that's a fair point. I'll have a look. Thanks!
In any case, it's weird to have numbers in there in the first place.
Ah yeah, that's a fair point. I'll have a look. Thanks!
Great! I hope that's really simple. It's a fundamental issue with the code that I think should be fixed & tested soon.
In any case, it's weird to have numbers in there in the first place.
Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.
Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.
Here's the issue:
adata = anndata.AnnData(
X=np.ones((10, 1000)),
var=pd.DataFrame(index=range(1000))
)
gives us a str (categorical) index. This is documented behavior (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html):
Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.
I can perform type checks and see whether the input is "number like" but it's of course not perfect...
I talked a lot with Phil about this. A perfect technical solution seems out of reach but maybe we can improve the situation at least. Ideas:
Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.
Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.
Agree those ontologies are often noisy and it's pretty much impossible for us to clean them (that would probably be a huge community effort). However, I would say that it can be dangerous to just plainly trust them then, e.g. auto-matching to ontology terms without having a human in the loop or at least without giving the option to check them manually (again this is very hard if you have 40k genes to check). I only realized it in that case, cause I was expecting no matches.
Some odd matches, might lead to some weird behavior upstream, which is pretty much impossible to debug or at least very hard.
Proposed solution: Why does .validate()
validate against synonyms? I think it shouldn't unless the user explicitly asks for it (even though I think that even then this is dangerous; I'd rather not make this possible). Hence, how about removing it and we'll no longer have this problem. I remember having multiple discussions with @sunnyosun about this but can't recall why we would consider a synonym "valid".
Background notes:
Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.
Darn, I forgot about this. For many cases, the magical []
that relies on casting any index to str makes things easier for users. But yeah, one should have not done the casting and rather added .iloc
for those cases in which people pass integer indexes. Isaac was very unhappy about it.
easier when AnnData switches to UUIDs
Interesting. My last discussion with Isaac about this dates 3 years back I guess. Back then I'd have just removed the casting and added .iloc
. But well.
These synonyms are not noise, and I believe ensembl is one of the most curated ontologies as everyone uses them. Also, other ontology sources we use are reasonably well-curated.
For instance: 225
is an actual synonym of EGR1
:
The current Curator does require a human in the loop and it's not possible to .save_artifact
without being validated, so I don't think there's a danger as long as users aren't blindly running .add_new_from()
.
Proposed solution: Why does .validate() validate against synonyms?
Because this is a high-level function meant to require minimal work from users, if we ditch synonym mapping, this curation process will get longer and require more work.
We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.
Ok, so, we're back to the big box that clarifies what "validated" means. We need this box and you should add it ASAP @sunnyosun @Zethson to this page: https://docs.lamin.ai/curate
I'm then almost certain that most users and we as data engineers don't want to consider a synonym
to be a validated value.
This is the whole point of .standardize()
. If you don't call .standardize()
, your values will not validate when they contain synonyms.
We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.
I think we need a solution that ascertains a meaningful definition of what a "validated dataset" is. As I said, I'm almost entirely sure that a validated dataset should not contain synonyms of anything.
In the second step we need to make the curation process bearable and concise. Calling .standardize()
under the hood is something one can consider, but also seems somewhat dangerous.
We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.
I wasn't aware of this change.
As it looks to me right now, this is a great danger for the integrity of the lakehouse and should be reverted ASAP.
Apologies if I'm misunderstanding things.
We already have an "Example" box here: https://docs.lamin.ai/curate
Imagine the "Experiment 1"
label has a synonym "Experiment_1"
.
Do we want to consider a "Experiment_1"
valid value under ULabel.name
? I don't think so.
Of course we can call it valid under ULabel.synonyms
but that's not what people should be using for curating data. To Curator
classes, we pass constraints along the lines of Gene.ensembl_gene_id
and not constraints that involve synonyms.
Report
ln.Curator.from_anndata
matches wrong gene symbols. For an AnnData object where the var index are just integers, the Curator matchesSCARNA10
andEGR1
. Which is a bit odd as the gene names are just integers in this case.That's the output of
curator.validate
:Here's an example on how to reproduce the bug:
Version information
anndata 0.10.9 bionty 0.52.0 lamindb 0.76.15 numpy 2.1.2 pandas 2.2.3 session_info 1.0.0
annotated_types 0.7.0 anyio NA appdirs 1.4.4 appnope 0.1.4 arrow 1.3.0 asgiref 3.8.1 asttokens NA attr 24.2.0 attrs 24.2.0 babel 2.16.0 botocore 1.35.51 certifi 2024.08.30 chardet 5.2.0 charset_normalizer 3.4.0 click 8.1.7 comm 0.2.2 cython_runtime NA dateutil 2.9.0.post0 debugpy 1.8.7 decorator 5.1.1 deprecation 2.1.0 dj_database_url NA django 5.1.2 dotenv NA exceptiongroup 1.2.2 executing 2.1.0 fastjsonschema NA fastobo 0.12.3 filelock 3.16.1 fqdn NA fsspec 2024.10.0 gotrue 2.8.1 graphlib NA h11 0.14.0 h2 4.1.0 h5py 3.12.1 hpack 4.0.0 httpcore 1.0.6 httpx 0.27.2 hyperframe 6.0.1 idna 3.10 ipykernel 6.29.5 isoduration NA jedi 0.19.1 jinja2 3.1.4 jmespath 1.0.1 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema_specifications NA jupyter_events 0.10.0 jupyter_server 2.14.2 jupyterlab_server 2.27.3 lamin_utils 0.13.7 lamindb_setup 0.80.0 lnschema_core 0.76.1 markupsafe 3.0.2 natsort 8.4.0 nbformat 5.10.4 overrides NA packaging 24.1 parso 0.8.4 pexpect 4.9.0 platformdirs 4.3.6 postgrest 0.13.2 prometheus_client NA prompt_toolkit 3.0.48 pronto 2.5.5 psutil 6.1.0 psycopg2 2.9.10 (dt dec pq3 ext lo64) ptyprocess 0.7.0 pure_eval 0.2.3 pyarrow 18.0.0 pydantic 2.9.2 pydantic_core 2.23.4 pydantic_settings 2.6.0 pydev_ipython NA pydevconsole NA pydevd 3.1.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.18.0 pythonjsonlogger NA pytz 2024.2 realtime 1.0.6 referencing NA requests 2.32.3 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rich NA rpds NA scipy 1.14.1 send2trash NA six 1.16.0 sniffio 1.3.1 sqlparse 0.5.1 stack_data 0.6.3 storage3 0.5.5 strenum 0.4.15 supabase 2.2.1 supafunc NA tornado 6.4.1 traitlets 5.14.3 typing_extensions NA upath 0.2.5 uri_template NA urllib3 2.2.3 wcwidth 0.2.13 webcolors 24.8.0 websocket 1.8.0 websockets 12.0 yaml 6.0.2 zmq 26.2.0 zoneinfo NA
IPython 8.29.0 jupyter_client 8.6.3 jupyter_core 5.7.2 jupyterlab 4.2.5
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] macOS-10.16-x86_64-i386-64bit
Session information updated at 2024-10-31 10:38