marklogic-community / marklogic-samplestack

A sample implementation of the MarkLogic Reference Architecture
Apache License 2.0
82 stars 56 forks source link

Related Tags Feature #82

Closed grechaw closed 9 years ago

grechaw commented 10 years ago

Semantics requirements have been distilled into the notion of related tags, which will be implemented as part of MS-3.

grechaw commented 10 years ago

Kicking this task as it doesn't seem to be in bugtrack

grechaw commented 9 years ago

Looking to push this feature to 8.0-2 as it seems like a stretch right now.

wooldridge commented 9 years ago

I'm implementing the related-tags feature in the browser layer.

Currently when I retrieve related tags, I get frequencies that vary for the same related tag depending on the origin tag. This doesn't seem right given the spec. For example:

relatedTo: "jquery" gives me _value: "javascript", frequency: 102

relatedTo: "html" gives me _value: "javascript", frequency: 91

The non-tag search criteria are the same in these instances, so I'd expect to get the same frequency value in each case (the frequency value being the number of results that I will get when I select javascript and the page updates). @grechaw, thoughts?

grechaw commented 9 years ago

you are right about the reason why they vary. Let me consider this

search A. "marklogic tag:javaScript" search B. "marklogic tag:html"

So to get the actual tags displayed for relatedTo, we need to use the tag: search, but the value used should be that of the number of documents having the tag -- not the number of documents in the filtered search.

Got it.

So when I look at the 1.0 wireframes, page 12, I see counts that behave as they are now. Can you point me to spec otherwise? On that page, I see a count for "necks (50)" which when selected is 185, which looks as though the counts differ based on the search history.

yawitz commented 9 years ago

My bad, I think the wireframes are wrong on this detail. The count in the popup should show 185, not 50, since that would be the count of that tag with no tags selected (i.e. the shadow count). Let me think this through to make sure I'm right. (If I am, is this fixable, as per Mike's comment?)

EDIT: Specifically, the text alongside the wireframes describes what Mike calls out, but the picture is incorrect.

grechaw commented 9 years ago

Well, sure its fixable. It's a quick extra call to the indexes to grab this number, so it's not quite default MarkLogic behavior, but it's a nice feature and shouldn't be hard to implement.

Given it's 'just a number' it seems it shouldn't block browser development. I can get that number to change in the middle tier (and since it will be in a transform, this will fix both node and java)

yawitz commented 9 years ago

Great. I'll update the wireframe image to match.

wooldridge commented 9 years ago

FWIW, here's what my in-progress UI looks like using the tags endpoint. When you click on "javascript" in the related tags list, you execute a search for the tag "javascript" and it returns 1041 documents (which is different from the frequencies in the tags list and the related-tags list). 20150218_relatedtags

yawitz commented 9 years ago

The number in the list will be different from the number in the popup, since the former is ORed with the current selection, while the latter replaces the current selection. (The numbers should match if there is no tag selection in the sidebar to start.)

wooldridge commented 9 years ago

Interestingly, even if you don't have the "jquery" tag selected above, you still get a javascript frequency of "91" in the related-tags list currently. I guess I expect in that case to have "1041" as the frequency?

yawitz commented 9 years ago

Well, that sounds wrong. Selecting the first tag using the sidebar should produce the same result as selecting that same tag via the popup.

wooldridge commented 9 years ago

Here's a screen shot for that scenario. 20150218_relatedtags2

laurelnaiad commented 9 years ago

If you think the wrong data are being returned from the middle-tier (I can't quite tell if that's what you're thinking), then it would be really helpful to post the JSON that is being sent to the server and the JSON that is being returned so that we can look at it raw and discuss the middle-tier behavior at the level at which it is communicating...

wooldridge commented 9 years ago

The wireframe spec needs to be tweaked to show the related-tag count matching the count realized after the related tag is clicked (@yawitz is doing this). The middle tier tags endpoint then needs to be updated to support this (@grechaw is doing this). Once this happens, the UI should just work.

@laurelnaiad, here's the JSON in:

{"search":{"qtext":[""],"timezone":"America/Los_Angeles","start":1,
"pageLength":10,"relatedTo":"html","sort":"frequency"}

Here's the JSON out:

{"values-response":{
"name":"tags","type":"xs:string","aggregate-result":[{"name":"count","_value":"1025"}],
"metrics":{"values-resolution-time":"PT0.03854S",
"aggregate-resolution-time":"PT0.000777S","total-time":"PT0.500191S"},
"distinct-value":[{"frequency":144,"_value":"xquery"},
{"frequency":111,"_value":"marklogic"},{"frequency":93,"_value":"xml"},
{"frequency":91,"_value":"javascript"},{"frequency":91,"_value":"json"},
{"frequency":53,"_value":"jquery"},{"frequency":30,"_value":"html5"},
{"frequency":26,"_value":"php"},{"frequency":25,"_value":"html"},
{"frequency":24,"_value":"ajax"}]}}
grechaw commented 9 years ago

I understand the expected and actual -- it was just hard to look at the data and verify it (for me) before Mike put the UI in front of it.

wooldridge commented 9 years ago

@grechaw great! I'm happy you can bend it to our will.

grechaw commented 9 years ago

The remaining issue that Mike has found is actually somewhat tricky/nasty to implement. I may have to rewrite the extension. The functionality under consideration is not directly supported in MarkLogic, and we need to write a bunch of (fast) parallel queries to get each of these frequencies.

popzip commented 9 years ago

Depending on how significant this effort it - and how 'unnatural' it is for use of semantics, may need to revisit the requirements. Is it the counts specifically that's driving the difficulty? Is getting counts during a semantic query like this not our usual recommendation for how to implement semantics? I can do more digging to make sure this aligns with more of a 'typical' semantics use case if I need to. remember, we designed the requirements based on what we thought would be a good demonstration of functionality, so if it no longer meets that ease of use / simplicity of demonstration we can rework the requirements in this case to fit.

laurelnaiad commented 9 years ago

Our team discussed the possibility of amending the specs to remove the need for up-front knowledge of related tags counts in the context of main search results (i.e. that the hover operation wouldn't reveal anything about whether there are related tags or how many of them there are).

Given that we're aiming for a custom overall search endpoint in order to support both the dynamic bucketing feature and improve overall search performance by consolidating shadow queries, there seems to be a looming rewrite of at least that endpoint. If we want to do something with this, we could lump it into that endpoint, too...

grechaw commented 9 years ago

I figured it out and will have a fix. I am gutting how the feature is implemented, but the new way is better -- it does all the processing within the server-side extension rather than using a extend-then-search approach.

The one thing I'm going to have to look into, oddly enough, is security. I need to make sure that the extension runs with the right permissions with regard to documents, so that the guest user sees counts for documents they can actually see.

popzip commented 9 years ago

Okay glad it sounds like you have a good solution. Just keep in mind if you are having to do really strange things that don't seem like 'best practice' let's reconsider whether this is the best way to show off semantics.

grechaw commented 9 years ago

@laurelnaiad that sounds good -- this solution is simplifying related tags in the service layer, so it's independent of whatever work we do for endpoints.

laurelnaiad commented 9 years ago

if you are having to do really strange things that don't seem like 'best practice' let's reconsider whether this is the best way to show off semantics

I think extensions can often be the best way to get the most out of MarkLogic and to create efficient architectures. I think we should embrace the extensibility within the reference architecture! :smile: :+1:

grechaw commented 9 years ago

OK, I'm in really good shape but I've uncovered a bug (a gap really, as setting permissions of the graphs endpoint was never supported) that means that only samplestack-contributor can see triples. The guest user cannot.

There are workarounds, but it involves some hackitude to set permissions on the triples documents such that the guest can see them. On the other hand, setting permissions on graphs is something I've specified for the 8+ time frame, so there is at least a fix in the offing for this gap.

So choice one -- modify requirements such that only ss-contributor can see related tags. Pro: done. Con: not Choice two -- always call related tags with samplestack-contributor Pro: easy. Con: counts will be innacurate for guest -- because it will reflect the number that a contributor would see were they to log in. So still a bug. choice three - write some extension that will set the permissions on triples. Pro: Makes app work as designed. Con: outside the box of MarkLogic offerings, a slightly hackish workaround. Not done yet.

In any case, I'm nearing delivery this so Mike can try it out.

grechaw commented 9 years ago

Eureka!

I can use a call to eval to solve this trivially, and that's an excellent, MarkLogic sanctioned way to do hacks like this :)

grechaw commented 9 years ago

OK Here's a design question that I'd like a little feedback on.

First-- this is a known issue, which has been raised as an RFE before. There's also a known pattern for getting this answer, which involves a little, very fast query for each tag in the result. But one best done close to the server. This is a ll very good.

So to make this simpler as a bugfix, I'd like to slightly alter the return value. Calls to /v1/tags generally get back a "count" total for all the tags. I'd like to repress this value for the call to related tags - as it would be yet another query, and it's clear from the wireframes that the call to get related tags does not need a total. The frequency in this case is as we've agreed -- the number of documents with the given tag in it.

So here's what I'm getting from the new version:

{
    "values-response": {
        "distinct-value": [
            {
                "_value": "data",
                "frequency": 1
            },
            {
                "_value": "marklogic",
                "frequency": 217
            },
            {
                "_value": "metadata",
                "frequency": 2
            },
            {
                "_value": "mongodb",
                "frequency": 4
            }
        ],
        "name": "relatedTags",
        "type": "xs:string"
    }
}
popzip commented 9 years ago

Does your question for workarounds for permissions still hold? Or did you resolve with an extension or the eval?

laurelnaiad commented 9 years ago

So this reflects the fact that when we ask for relatedTags and give it a limit of, say, 100,000, that we'd get back at all of the related tags (except if there are more than one hundred thousand of them)? Which count is missing? That one of the number of related tags? Because obviosly we know the array length you're giving us....

This does seem to not solve the issue in the "More Tags" dialog (i.e. when we don't ask for tags related to some other tag). We still have no way to know how many pages there are (unless we put the limit at 100,000, which might not be very friendly to the servers). But that issue is already present.

Are these truthful statements? @yawitz @wooldridge am I on the right page?

laurelnaiad commented 9 years ago

There are workarounds, but it involves some hackitude to set permissions on the triples documents such that the guest can see them.

So the issue is that you need to give guests access to the triples? Why shouldn't a guest be allowed to read all the triples?

grechaw commented 9 years ago

On permissions -- The issue I ran into is in the REST API -- the REST API for semantics provides +no permission-setting mechanism+, and unless we set permissions in an alternate way (see below), then guest sees no triples, and hence no related tags. The guest should definitely be able to see the triples.

I wrote a function as part of dbload, a little piece of server-side JS, which sets the permissions for the triples appropriately using builtins. This little chunk of code is run as an eval. The not-best-practice part of this approach is that I've added a privilege to samplestack contributor that ideally they would not have. It's only to run this workaround that ss-contributor has to have the xdmp:eval and xdbc:eval privileges.

It will take some time before REST supports non-default permissions on /v1/graphs (it's in 8+ planning, presumably 8.0-3). Once that's in place, this part of the code can be taken out.

yawitz commented 9 years ago

Re: https://github.com/marklogic/marklogic-samplestack/issues/82#issuecomment-76276003 (3 comments back), I'm not sure what counts we're talking about here. What we need for the Related Tags UI is the document count for each related tag, not the count of related tags (which, in any case, would probably always be small, right?). This would have nothing to do with the problem of tag count needed for the "more tags" dialog.

grechaw commented 9 years ago

As far as count goes -- I was referring to a value that comes back to the browser for most tags calls -- a total number of documents across all tags. That JSON key, called "aggregate-count", is not in the above payload... otherwise it looks like other calls to /v1/tags.

(what mitch said otherise)

wooldridge commented 9 years ago

Just to restate to make sure I know what's going on...

  1. We shouldn't need a tag count returned when we make a relatedTo call to the /v1/tags endpoint (for the related-tags popup).
  2. We still need a tag count returned when we make a forTag call to the /v1/tags endpoint (for the all-tags dialog).

So I believe what @grechaw and @laurelnaiad are saying are correct. And @grechaw is just saying he will do #1, i.e. not show the tag count for the relatedTo call.

laurelnaiad commented 9 years ago

Curious... why isn't the admin user the one whose credentials are used to populate in dbLoad and thus the one generally to be setting permissions?

Also, why eval if you've made a REST extension to patch the permissions on the graph? Wouldn't that be an http call to somthing that only the admin user can hit, and which uses an amped function?

Perhaps there are gaps in the api surface area in 8.0 that preclude this or you just don't think those techniques are a good way to teach?

grechaw commented 9 years ago

delivered a patch to Mike to try out.

wooldridge commented 9 years ago

I merged the work by @grechaw with my UI work here: https://github.com/wooldridge/marklogic-samplestack/tree/related-tags

The related-tag numbers now look right (e.g., you see the "xquery (267)" related tag across different source tags, with the same "267" frequency for all).

The response time seems a bit slow in my setup, a couple seconds to get the related tags for a tags.

@grechaw, please take a look.

grechaw commented 9 years ago

Will do, thanks!

grechaw commented 9 years ago

We noted yesterday that there is still a bug in the implementation that Mike most recently demonstrated. The related tags are not using the search criteria from the filtered/main query. I'm not able to focus on this remaining issue immediately, but it's certainly fixable for 1.1.0

laurelnaiad commented 9 years ago

It looked like the browser was omitting the query text from the tags search when doing related tags...

wooldridge commented 9 years ago

I've updated the related-tags branch so that calls to the tags endpoint for related tags pass all existing search criteria (except for any selected tags). For example, here is a call for tags related to "html" (includes qtext and data range):

{"search":{"qtext":["java"],"start":1,"query":{"and-query":{"queries":[
{"range-constraint-query":{"constraint-name":"lastActivity","value":"2010-12-01T08:00:00.000Z",
"range-operator":"GE"}},{"range-constraint-query":{"constraint-name":"lastActivity",
"value":"2014-04-01T07:00:00.000Z","range-operator":"LT"}}]}},"timezone":"America/Los_Angeles",
"pageLength":100,"relatedTo":"html","sort":"frequency"}}

However, the frequencies that come back don't change with different search criteria. For example, the "json" related tag always has a frequency of 142. I think this is what Charles is referring to, so will pass this bug back to him.

grechaw commented 9 years ago

This is perfect -- the browser sending a query that middle tier doesn't have to munge is good info. Getting the counts right in this scenario will be more straightforward.

laurelnaiad commented 9 years ago

Right! :) https://github.com/marklogic/marklogic-samplestack/issues/401#issuecomment-69442387

grechaw commented 9 years ago

@wooldridge and I have been working toward the right behavior and he provided the logic I needed to get the counts right. The PR containing the Java fixes will come from alongside those for node.js tier.

gghai commented 9 years ago

Umbrella task , tested on app on 3000 and 8090 . marking this as done.