singnet / das-query-engine

Query engine and pattern matcher
MIT License
2 stars 2 forks source link

Compositional query doesn't work on bioAS subset #231

Closed andre-senna closed 5 months ago

andre-senna commented 5 months ago

@CICS-Oleg

Hey Oleg, can you please provide more details? For instance, how to reproduce the error?

CICS-Oleg commented 5 months ago

@andre-senna Hi! When I try to run the follwing query on your actual endpoin

das = DistributedAtomSpace(query_engine='remote', host=host, port=port)
print(f"Connected to DAS at {host}:{port}")
print("(nodes, links) =", das.count_atoms())

query = [{'atom_type': 'link', 'type': 'Expression', 'targets': 
    [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene'}, 
     {'atom_type': 'variable', 'name': '$gene'}]}, 
         {'atom_type': 'link', 'type': 'Expression', 'targets': 
             [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene_type'}, 
              {'atom_type': 'link', 'type': 'Expression', 'targets': 
                  [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene'}, 
                   {'atom_type': 'variable', 'name': '$gene'}]}, 
              {'atom_type': 'node', 'type': 'Symbol', 'name': 'protein_coding'}]}]

for mapping, subgraph in das.query(query, query_params):
    print(type(mapping))

I get empty list.

CICS-Oleg commented 5 months ago

Sry, forgot to provide example earlier during issue creation.

CICS-Oleg commented 5 months ago

I have the same issue with queries of this type:

[{'atom_type': 'link', 'type': 'Expression', 'targets': [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene_name'}, {'atom_type': 'link', 'type': 'Expression', 'targets': [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene'}, {'atom_type': 'variable', 'name': '$ens'}]}, {'atom_type': 'node', 'type': 'Symbol', 'name': 'IRX3'}]}, {'atom_type': 'link', 'type': 'Expression', 'targets': [{'atom_type': 'node', 'type': 'Symbol', 'name': 'genes_pathways'}, {'atom_type': 'link', 'type': 'Expression', 'targets': [{'atom_type': 'node', 'type': 'Symbol', 'name': 'gene'}, {'atom_type': 'variable', 'name': '$ens'}]}, {'atom_type': 'variable', 'name': '$p'}]}]

andre-senna commented 5 months ago

@CICS-Oleg @Necr0x0Der

The problem is that the file that contains the expressions you are searching for have failed being parsed/loaded. I didn't noticed it before. Anyway, the reason the parser failed is because of expressions like this:

(synonyms (gene ENSG00000278267) (microRNA_6859-1 hsa-mir-6859-1 HGNC:50039 microRNA_mir-6859-1 MIR6859-1))

Which have a : in the middle of a symbol name. Since : have a special meaning I forbid it from being used as part of symbol names in the parser.

I think it's really really really weird to have all those bizarre symbol names like this one and others like DEAD/H_\(Asp-Glu-Ala-Asp/His\)_box_polypeptide_11_like_1 which is present in the same file. And there are others even more bizarre with full paragraphs of text. I believe all those symbols should be literals instead, enclosed in double quotes " ".

So it seems to me like it's a problem with the dataset. What do you think?

Necr0x0Der commented 5 months ago

@andre-senna , we also had issues with these files, and asked to replace ( with \(, so brackets for expressions could be distinguished from brackets in names/symbols. : should be less problematic, because it has special meaning only when it is a separate symbol. In particular, MeTTa don't prevent from introducing such custom symbols as ::. It depends on your parser, but : can indeed be used as a part of other symbols or tokens. Using double quotes " " has special meaning in MeTTa. If something is enclosed in " ", it is converted into the grounded atom (of String type), which is fundamentally different from symbol atoms. In particular, A and "A" cannot be matched. Also, grounded atoms containing strings are always of String type. You cannot have, say, "A" of Gene type. Right now, the interpreter atomspace doesn't index grounded atoms, so it wasn't recommended to use strings (" ") if they can be used as keys for retrieval. While I agree that long descriptions would be better represented as grounded strings (like "EAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 like 1 "), something like HGNC:50039 looks more like a symbol. As a temporary solution, you can ask the author of these files to use " ", of course. However, we have issues with interoperability for grounded atoms. For example, 2 is a grounded atom in MeTTa, but it is also represented as a symbol in DAS. I'm not sure if there is a good workaround for this without type information, so we may need to discuss all these issues together.

andre-senna commented 5 months ago

@Necr0x0Der @CICS-Oleg

so we may need to discuss all these issues together.

Yes, I agree.

So I'll change the parser to accept HGNC:50039 as a valid symbol name. I believe it's an easy change.

Thanks!

andre-senna commented 5 months ago

@CICS-Oleg @Necr0x0Der

This is fixed and available in das-cli version 0.2.9 or above.