monarch-initiative / koza

Data transformation framework for LinkML data models
https://koza.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
48 stars 4 forks source link

Potentially undesired behavior with "in" filter in row_filter.py #145

Closed DnlRKorn closed 3 months ago

DnlRKorn commented 3 months ago

For row filters, we have functionality to capture behavior of "filter if column is within a list defined in yaml".

https://github.com/monarch-initiative/koza/blob/7087ee6e1bbd347a8f792fefeca477a58288ed18/src/koza/utils/row_filter.py#L55-L56

An example of this behavior occurs with the go_annotation ingest. https://github.com/monarch-initiative/monarch-ingest/blob/72adbffd168d5f59cce83306917d2aeffd0b2602/src/monarch_ingest/ingests/go/annotation.yaml#L27-L49

However this fails on the following row in the 9606.go_annotations.gaf

RNAcentral URS0000316FA5_9606 URS0000316FA5_9606 acts_upstream_of_negative_effect GO:0051607 PMID:26222045 IDA P Homo sapiens (human) hsa-miR-26b-5p miRNA taxon:9606|taxon:12814 20190709 ARUK-UCL occurs_in(CL:2000001)

Because the taxon row's value "taxon:9606|taxon:12814" doesn't have an exact match with any of the options defined in the go/annotation.yaml.

DnlRKorn commented 3 months ago

This issue would be resolved by the following; however I need to explore how this would affect other "in" filters we have defined in Monarch-Ingest

    def inlist(self, column_value, filter_values):
        filter_in_column = any([filter_value in column_value for filter_value in filter_values])
        return (column_value in filter_values) or (filter_in_column)
DnlRKorn commented 3 months ago

Closed with #146