opensearch-project / opensearch-net

OpenSearch .NET Client
Apache License 2.0
107 stars 49 forks source link

[FEATURE] - Performing complex relational queries is merely impossible. #446

Open DumboJetEngine opened 11 months ago

DumboJetEngine commented 11 months ago

Is your feature request related to a problem?

While trying to translate some SQL relational queries into OpenSearch queries using nested documents and the OpenSearch.Client, I got into MANY problems due to the fact that I had to gather all nested fields under a single nested query. I wrote a question here, but I am not sure if is clear enough.

Essentially, when the SQL query is complex and involves many JOINed tables, the nested queries in OpenSearch become too complex to write by hand.

For example, say you have this SQL query:

SELECT *
FROM Document d
INNER JOIN NestedType1 t1 ON t1.Id = d.Id  -- nested document 1
INNER JOIN NestedType2 t2 ON t2.Id = d.Id  -- nested document 2
WHERE (
    (t1.Key = 1 OR t2.Value = 7) AND (t1.Key2 = 5 OR t2.Value = 9 OR d.Id = 7)
)

To translate that into an OpenSearch query, after you create the nested documents for t1 & t2, you have to gather all t1 filters under a single nested query, all the t2 filters under another nested query, and all non-nested field filters outside the nested queries.

By doing some manual boolean transformations one could shape the query like that:

(t1.Key = 1 || t2.Value = 7) && (t1.Key2 = 5 || t2.Value = 9 || d.Id = 7)

    ===>

(t1.Key = 1 && t1.Key2= 5) ||
(t1.Key = 1 && t2.Value = 9) ||
(t1.Key = 1 && d.Id = 7) ||
(t2.Value = 7 && t1.Key = 5) ||
(t2.Value = 7 && t2.Value = 9) ||  -- not possible
(t2.Value = 7 && d.Id = 7) 

    ===>

( NestedQuery(t1.Key = 1 && t1.Key2= 5) ) ||
( NestedQuery(t1.Key = 1) && NestedQuery(t2.Value = 9) ) ||
( NestedQuery((t1.Key = 1)) && d.Id = 7 ) ||
( NestedQuery(t2.Value = 7) && NestedQuery(t1.Key = 5) ) ||
( NestedQuery(t2.Value = 7) && d.Id = 7) 

This gets very easily out of hand for more complex queries, and even more for dynamic queries where you cannot statically analyze and simplify the queries beforehand.

What solution would you like?

A way to preprocess queries, in order to gather all nested fields under a single nested query would be great, since this cannot be done manually most of the times. In the example above, it would be great if the mentioned query "simplification" and gathering of all:

The way I am imagining the implementation is something like this, which uses C# Expression trees in the .Relational() method:

var response = openSearchClient.Search<Document>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(mq =>
        {
          return mq.Relational(rq =>
            return  (  rq.Term(d => d.t1.First().Key, 1)  ||  rq.Term(d => d.t2.First().Value, 7)  ) &&
                (  rq.Term(d => d.t1.First().Key2, 5)  ||  rq.Term(d => d.t2.First().Value, 9) || rq.Term(d => d.id, 7) );
          );
        })
      )
    )
  );

I guess a similar feature must already exist in the SQL plugin for OS, since that problem would also be present there and had to be solved somehow, although the SQL plugin has some restrictions from what I can recall.

What alternatives have you considered?

DumboJetEngine commented 11 months ago

Correcting myself on this statement :

I guess a similar feature must already exist in the SQL plugin for OS, since that problem would also be present there and had to be solved somehow

Turns out, the SQL plugin has the same (or similar) issues (check bullet #4): https://opensearch.org/docs/latest/search-plugins/sql/sql/complex/

Quoting the page :

In a WHERE statement, don’t combine trees that contain multiple indexes.

So, using the SQL plugin does not seem to be a viable solution at all.

DumboJetEngine commented 11 months ago

And on a second review, using C# expressions does not seem like a good idea (too many limitations). Probably, preprocessing the query before sending it to O.S. would be better. And ask for it like that :

var response = openSearchClient.Search<Document>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(rq =>
        {
            return  (  rq.Term(d => d.t1.First().Key, 1)  ||  rq.Term(d => d.t2.First().Value, 7)  ) &&
                (  rq.Term(d => d.t1.First().Key2, 5)  ||  rq.Term(d => d.t2.First().Value, 9) || rq.Term(d => d.id, 7) );
        })
      )
    )
    .RelationalMode(true)
  );

But I am not 100% sure how one could exclude specific nested queries from being "relationalized". Perhaps with an extra parameter that could be specified on each nested query.

DumboJetEngine commented 11 months ago

I have started experimenting on an implementation for this. I am not sure what will be the end result, but it is interesting for me already...

DumboJetEngine commented 11 months ago

I have a working implementation of this, that also adds some extra tiny features. It appears to work for the queries I tested it on, but I will need to test it some more.

One thing that I had to do was to add a way of cloning IQuery/QueryBase instances, and I have created an extension method that handles each type of query individually, but I would guess that the best thing would be to make IQuery expose a .Clone() method. This, however, will make the code diverge more from the original ElasticSearch code (but maybe that is OK?) and it will also not keep the changes contained in a small portion of the code, but they will be all over the place. If you want me to add the .Clone() method, let me know.

I am planning to add some tests as well, but I have various issues with Visual Studio not running the tests.

There were also some "challenges" with preserving the scoring behavior with Filter vs Must, but I think this is of smaller importance when you try to port SQL queries to O.S. .

This is how it works now:

var queryChangeLogStringBuilder = new StringBuilder(4 * 1024);
var queryResponse = openSearchClient.Search<Book>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(rq =>
        {
            var query = (
                    (rq.Term(d => d.Tags.First().Name, "cats") || rq.Term(d => d.Price, 11))
                    &&
                    rq.Term(d => d.Tags.First().Name, "dogs")
                )
                .Name("cats-and-dogs")
                .Strict(true)
                .TransformToRelationalEquivalent<Book>(queryChangeLogStringBuilder);

            return query;
        })
      )
    )
  );

[OpenSearchType(IdProperty = "Id")]
public class Book
{
    public int Id { get; set; }
    public string? Title { get; set; }
    public int Pages { get; set; }
    public decimal Price { get; set; }

    [Nested]
    public Tag[] Tags { get; set; } = new Tag[0];

    [Nested]
    public Meta[] Meta { get; set; } = new Meta[0];

}

public class Tag
{
    public string? Name { get; set; }
    public int Weight { get; set; }
}

public class Meta
{
    public string? Name { get; set; }
    public string? Value { get; set; }
}
DumboJetEngine commented 10 months ago

@Xtansia Is it OK if I send, in one commit and one PR, the implementation for this issue ( #446 ), #380, and #375 ?

Xtansia commented 10 months ago

@DumboJetEngine for ease of review and change tracking purpose, it would best that they're separate PRs please

DumboJetEngine commented 10 months ago

@Xtansia OK. Done. If you can review and merge the two small PRs I have just sent, so that I can resolve any conflict for the 3rd, it would be great.