kofrasa / mingo

MongoDB query language for in-memory objects
MIT License
952 stars 83 forks source link

collation support #100

Closed johnnymaxwell closed 5 years ago

johnnymaxwell commented 5 years ago

Standard sorting returns upper case words before any lower case ones. For example,

db.zort.find({},{_id:0}).sort( {name: 1} ); { "name" : "Bob" } { "name" : "Tom" } { "name" : "alice" } { "name" : "peggy" }

MongoDB supports the collation operation. For example,

db.zort.find({},{_id:0}).collation({locale: "en" }).sort( {name: 1} ); { "name" : "alice" } { "name" : "Bob" } { "name" : "peggy" } { "name" : "Tom" }

stefanholzapfel commented 5 years ago

That really is a show stopper. Any updates on this?

Most important is the possibility to change the "strength" attribute on the collation. That would lead to case-insensitive sorting.

Thank you!

johnnymaxwell commented 5 years ago

I have an example of collation support coded up and commented in a new mingo.js file. I am not sure how to extract the pieces into the separate lib files and rebuild the distribution though. I'm sure someone out there can do this in their sleep. I can attach the mingo.js file next time if so.

kofrasa commented 5 years ago

Thanks for the work @johnnymaxwell. I wish I could get to this soon but I haven't had the time. Since you already have something down, I am happy to point you in the right direction.

In mingo the collection operations are implement by operators from the aggregation pipeline and used by the Cursor object.

MongoDB allows specifying some options for the aggregation pipeline pipeline which mingo has not implemented yet. This includes an optional collation option defined as;

collation: {
   locale: <string>,
   caseLevel: <boolean>,
   caseFirst: <string>,
   strength: <int>,
   numericOrdering: <boolean>,
   alternate: <string>,
   maxVariable: <string>,
   backwards: <boolean>
}

I have added general support for aggregation pipeline options in b87ce4f1. The collation setting is available but currently a no-op.

example:

// will sort as it currently does.
mingo.find(collection, {}, {_id:0}).collation({locale: "en" }).sort( {name: 1} )

To make it effective the $sort operator needs to be updated. If you have an implementation, you can patch that function by using the collation setting obtained from opt['collation'].

It will be great if your PR could address the full requirement as specified here https://docs.mongodb.com/manual/reference/method/db.collection.aggregate/ nonetheless just handling the "locale" would already be a lot of progress.

stefanholzapfel commented 5 years ago

I guess for most people the "strength" and "caseLevel" parameter is even more important. (https://docs.mongodb.com/manual/reference/collation/)

In my MongoDB instance I sort using the "en" locale even though I use german special characters. Since it's unicode it seems to sort even non-english special characters correctly. But the sort in Mongo DB is by default case-sensitive (who's idea was that? :))

kofrasa commented 5 years ago

More about collation option fields here https://docs.mongodb.com/manual/reference/collation/#collation-document-fields

johnnymaxwell commented 5 years ago

But the sort in Mongo DB is by default case-sensitive (who's idea was that? :))

Yes indeed? Someone was engineering minded, not end-user minded. Oh well...

Thank you two for your guidance. In my proof of concept I mapped MongoDB strength to JavaScript sensitivity using https://docs.mongodb.com/manual/reference/collation/#collation-document-fields and https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare respectively. For now sensitivity 1 => 'base', 2 => 'accent', 3 => 'case', and default => 'base'. I added $collation: $collation to pipelineOperators which seems to be what kofrasa wants anyway per above. You are initially constrained by what JavaScript provides. The proposed change isolates the pipeline nuances per the link kofrasa noted into a single function. The next step is to adjust for the broader aggregation pipeline semantics. I am English speaking so any nuances would be informative. Thursday Eastern US time I'll give the PR a try. Thanks for your interest. I've written multiple enterprise libraries used by the Fortune 500 and had 2 tech buyouts. Mingo is an awesome achievement kofrasa! Many thanks for your effort from the MongoDB community.

kofrasa commented 5 years ago

Thanks @johnnymaxwell, I really appreciate that others find the project useful.

I think you are on the right track with regards to using string.localCompare. The options available seem to match most of the requirements in MongoDB. $collation is not an operator by itself and so a separate function is not required per se unless to make things cleaner.

Currently, sortBy is where the actual sorting happens. I think it will be neat to pass a comparator function from $sort which is collation aware to sortBy which will be used for comparing. The function is not exposed so the last ctx argument could be removed in favour of the comparator.

johnnymaxwell commented 5 years ago

Thank you for the feedback. I followed the current execution flow pattern AFAIK. At a high level I have the new pipelineOperator,

function $collation(collection, collationSpec) {
   this.collationSpec = collationSpec; // Remember this for this Qquery instance in $sort
    return collection;
}

It may be called from Cursor.collation(...). Then in $sort I look for the collationSpec. If found then the brand new collatationSort() function is called. It uses the MongoDB to JavaScript mapping mentioned above. As you noted in your code, nulls are less than everything. To complete the execution flow pattern collationsSort's last line is

return Lazy(sorted); // MUST return an iterator 

Here is the basic find() flow,

let cursor = query.find(collection);
cursor = cursor.collation( {locale: "en" })
cursor = cursor.sort( {last: 1, first: 1} );
var result = cursor.all();

I hope to complete things when I get back next week.

johnnymaxwell commented 5 years ago

I forked the project and started looking at your latest code as of 9 days ago. The collation research was done in a complete mingo.js file of 3+ weeks ago. This was my first attempted PR, so I had challenges mapping the marked changes into the separate source files and even building things.

I saw in your new sort.js file,

export function $sort (collection, sortKeys, opt) {
  if (!isEmpty(sortKeys) && isObject(sortKeys)) {
    opt = opt || {}
    let cmp = compare
    let collationSpec = opt['collation']
    if (isObject(collationSpec)) {
      // TODO: create a new "cmp" based on the collationSpec.
    }
...

It seems that you are allowing a new parameter to the $sort pipeline operator. The execution flow I put together had a $collation operator in the pipeline prior to the $sort operator. That seems to be the standard MongoDB aggregation style. Any arguments from the $collation stage are later used by the $sort stage.

I thought the building up of the indexKeys in mingo did not fit well into the desired collation behavior. As a result, I created a separate function (collationSort) that handles multi-field sorting with collation awareness.

Given these nuances, I do not want to make an autocratic design decision since this is your project that you know so well. Attached is the mingo.js file with the draft changes marked with '// <== JMAX' to start a change and '// JMAX ==>' at the end of the change. There are only 5 main changes. Searching for 'JMAX' will be sufficiently unique. I need some experience in the PR lifecycle. A summary of the changes for your review is below. I hope you and the community find this addition useful to an already very useful project.

function collationSort(collection, collationSpec, sortKeys)

function $sort(collection, sortKeys)

function $collation(collection, collationSpec)

var pipelineOperators = {...}

Cursor now has a collation function.

mingoWithCollationExample.zip

kofrasa commented 5 years ago

Hi @johnnymaxwell, thanks so much for the contribution. The general idea behind your implementation looks ok, but it seems to be doing a lot of work that is already taken care of.

I have not seen $collation as a separate operator in MongoDB aggregation docs so a new operator might be confusing if we expose that in the API but might be ok as an internal function. That was the idea behind the TODO.

I have cleaned up things a little and worked it into the current sort function along with some tests. You can add the tests into your fork to validate whether things are working as they should.

Again thanks a lot for the contribution.

stefanholzapfel commented 5 years ago

Hi guys!

@johnnymaxwell @kofrasa

Thanks very much for your work.

I have tried to use the collation option with the aggregation framework, but as soon as I add the options object (as specified in the mongo docs), it doesn't sort anymore:

const mingoSorter = new (mingo as any).Aggregator(
   [
      { $sort: sorter }
    ],
    {
      collation: {
         locale: 'en',
         strength: 1
      } 
   }
);
result = mingoSorter.run(result);
stefanholzapfel commented 5 years ago

Sorry my bad, I had a problem in my sorting syntax.

Sorting strings seems to work flawless. But when adding the collation, numbers are not sorted anymore? Is that possible or should I further investigate my code? :)

kofrasa commented 5 years ago

Hi @stefanholzapfel.

Please can you provide a working snippet with sample input using runkit, and the expected results?

stefanholzapfel commented 5 years ago

@kofrasa Thanks for the quick response!

Two issues that I can reproduce with runkit:

1.) Performance is very poor when using locale on string sort: https://runkit.com/stefanholzapfel/5c8ff4e6204a0b0011d98e25

2.) Sorting for numbers with locale throws error (a.localeCompare is not a function): https://runkit.com/stefanholzapfel/5c8ff5aba4e5ef0013bd6700

kofrasa commented 5 years ago

I already have a fix for issue 2.

I will have to investigate issue 1 a bit more.

kofrasa commented 5 years ago

There is an error in the first example of runkit.

stefanholzapfel commented 5 years ago

Sorry, I was not aware that my last change in runkit was auto-saved....data reactivity sometimes sucks :)

Should run now.

kofrasa commented 5 years ago

I made an update to compare against native comparison and made two observations.

First, the performance of mingo sort is not great. Not desirable but it is good news as it shows there is room for improvement which I will look at soon.

Second, which is bad news is that localeCompare sucks even when used independently with the native sort function. mingo's implementation is somehow faster from what I see. The bad performance dwarfs any slow downs coming from mingo itself.

See example here. https://runkit.com/embed/1es488dnl6xl

stefanholzapfel commented 5 years ago

Yah it's obviously the localeCompare that is so slow.

The normal mingo performance is more than fine for me.

stefanholzapfel commented 5 years ago

The issue is discussed here: https://stackoverflow.com/questions/14677060/400x-sorting-speedup-by-switching-a-localecompareb-to-ab-1ab10

Last response is maybe worth a try?

kofrasa commented 5 years ago

Thanks for the link. The regex strategy might be worth trying. For now I have pushed 2.3.2 to address the error sorting non-string values.

stefanholzapfel commented 5 years ago

@kofrasa Just wanted to make a pull request, but I have seen you're already working on it in dev branch.

  const collator = new Intl.Collator(spec.locale, localeOpt);

  return function (a, b) {
    // non strings
    if (!isString(a) || !isString(b)) return compare(a, b);

    // only for strings
    //var i = a.localeCompare(b, spec.locale, localeOpt);
    let i = collator.compare(a, b);
    if (i < 0) return -1;
    if (i > 0) return 1;
    return 0;
  };

Declaring the collator in advance improves performance by factor 5 for me. Not brilliant though, but better than nothing.

kofrasa commented 5 years ago

Thanks @stefanholzapfel. Yes I already added that change taken from the StackOverflow article you sent earlier. In my tests the performance improvement was at least 10x. I also shaved off about 50ms from normal sorting. The benchmark uses the code you provided. Both these changes are available in mingo@2.3.3.

MINGO SORT WITH LOCALE: 64.662ms
MINGO SORT WITHOUT LOCALE: 42.947ms
NATIVE SORT WITH LOCALE: 974.475ms
NATIVE SORT WITHOUT LOCALE: 2.483ms
kofrasa commented 5 years ago

The native sort version uses a.localeCompare(b) while mingo uses Intl.Collator

kofrasa commented 5 years ago

https://runkit.com/embed/258yfy63q117