vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.79k stars 604 forks source link

Unclear "namespacing" and YQL in docs #5617

Closed jazoom closed 6 years ago

jazoom commented 6 years ago

I've spent the better part of today playing around with YQL and trying to work out what all the different forms of "namespacing" refer to. I think I've read everything in the docs that could possibly be related to this (multiple times) and I'm still not clear on a few things.

Take this URL /search/?sources=VAR1&restrict=VAR2&yql=select+%2A+from+VAR3+where+name+contains+%22text%22%3B /search/?sources=music&restrict=music&yql=select+%2A+from+music+where+name+contains+%22text%22%3B

I guess, for all the examples in your tutorials, all 3 VARs should be music. From playing around with the queries I was able to determine VAR3 could be the name of the document type. If it refers to a non-existent document type the query returns an error. VAR3 is the "sources" parameter in YQL. But then the cluster ID and the two first lines of the search definition are also "music", so I suppose it could come from either of those too.

FROM specifies which document sources to search, it is handled in a similar way to Vespa's "sources" parameter.

What exactly is a "source"? Is it the cluster ID set in services.xml? Your documentation says it's similar to the "sources" query parameter but does not say how it is different. I initially thought they were two ways of saying the same thing but when I change VAR1 it actually doesn't alter the result at all. By that I mean the correct document is still returned, but if VAR1 refers to an incorrect document type then there is also an extra errors field returned (along with the correct result). What is the point of having both of these?

More confusion:

Every single example I could find in your docs is of the form: http://hostname:8080/document/v1/music/music/docid/Michael-Jackson-Bad

I'm referring specifically to the /music/music/ part. You don't give any example of them being different so I can't figure out what they actually refer to. I assume the second /music/ refers to the document type.

In the tutorial it mentions that in the search definition file the first two lines should contain the same name:

search music {

    document music {

...

Is it here that the first /music/ is chosen or do they always just need to be identical here?

In services.xml you have this:

<content id="music" version="1.0">
    <documents>
      <document type="music" mode="index"/>
    </documents>
    <nodes>
      ...
    </nodes>
</content>

Is it here where the first /music/ comes from? If I change the id of to cluster then will the document URL become http://hostname:8080/document/v1/cluster/music/docid/Michael-Jackson-Bad?

It gets more confusing

Okay, so I could probably eventually figure that out through trial and error of changing the application (and I've already done heaps of that in trying to get my queries working), but I don't even see how that fits into search URLs. Is the first /music/ the same as the first sources in the search URL, the same as the second sources, the same as neither of them, or the same as both of them?

Application namespacing

So what if I have a second application running on the same nodes. How do I specify to search only documents from that application? I've read that the application name is derived from the directory name of the application's files, but haven't seen anywhere in your docs the actual use of that name.

Some of my experimenting gave confusing results

One of the experiments I ran to figure out some of this stuff is I did a POST request to insert a document into http://hostname:8080/document/v1/application/music/docid/Michael-Jackson-Bad. The confusing part of this is that it was accepted. Then I had two different documents, one at http://hostname:8080/document/v1/music/music/docid/Michael-Jackson-Bad and one at http://hostname:8080/document/v1/application/music/docid/Michael-Jackson-Bad. I have to say, I don't really understand what that first /music/ is doing.

I kinda feel stupid not being able to figure all this out, and surely I would eventually figure it all out with heaps more trial and error, but I suspect if this is not clear to me after hours of research and testing it will also not be clear to at least one other person.

kkraune commented 6 years ago

Wow, this is great feedback! Thanks for taking the time.

I agree this is confusing, and examples should be better - it would help is not everything was music.

I will go through the doc and replace names. indicate namespace and document types to avoid confusion. The sources/restrict part is also vague, I will try to fix docs. Will take me a few days, but the next guy will thank us :-)

bratseth commented 6 years ago

Yes, we shouldn't use the same name for different things in examples. (Another thing we should not do is use names which can be mistaken for built-in identifiers, such as "cluster". In some cases I tend to use e.g. "mycluster" instead.)

A few other clarifications:

jobergum commented 6 years ago

Thanks for the great feedback @jazoom 👍 @kkraune It's a lot of doc to go through and sample-apps/feed data etc.

jazoom commented 6 years ago

@bratseth

The "source" parameter can mean either a cluster or a document type. If it means a cluster, the restrict parameter is used to select only some document types of that/those cluster(s).

/search/?sources=VAR1&restrict=VAR2&yql=select+%2A+from+VAR3+where+name+contains+%22text%22%3B

Are you saying that VAR1 is the cluster-id and VAR2 is the document type? How does VAR3 play into this? I'm not sure why VAR1/VAR2 are even a thing when in my testing it seemed that VAR3 is a required field anyway. The docs say VAR3 is a "source" (along with VAR1), but never actually says what that means. I just take it to mean document type. But I guess in the case of VAR1 it means cluster-id.

You probably do not want to deploy multiple applications to the same config server. That mode (multitenant config server) lets you create a Vespa PaaS, but there is a lot to doing that and we have not made any attempt to document it. If you think this is really something you should do let's discuss it further.

I don't think my needs for Vespa will take me far from what you would consider a normal use case. I will not worry about multiple applications then. My impression from the docs was that multiple applications could be a way to namespace different projects. I don't mean the docs said this; it was just me trying to interpret how to namespace different projects.

The other option I thought about was namespacing projects via different content clusters. This is when I really tried to ponder what significance the cluster-id has and how it would play into query URLs. The namespacing would be pointless if it didn't factor into the query URLs.

I also considered that if I used clusters to separate projects then unless I wanted to make a complicated services.xml file I'd need to have all documents in a project have the same search settings (visibility-delay, query-timeout, etc.).

What would you suggest is the way I should be going about this? Some projects just aren't large enough to warrant an entirely different Vespa cluster, which would increase operational complexity.

bratseth commented 6 years ago

The separate parameters "source" and "restrict" predates the YQL query language and we need this to work also for those who use those parameters and use a mixture of YQL, other query languages which don't support source internally, and programmatic query building. VAR1 and VAR3 mean the same thing and we take the superset if you use both. VAR2 means something else.

This is when I really tried to ponder what significance the cluster-id has and how it would play into query URLs.

I see. It does not, because another typical use case is to search multiple content cluster (sources) in the same query. If you want URL paths to designate different sources, you can deploy your own handlers and bind them to the paths you want: http://docs.vespa.ai/documentation/jdisc/developing-request-handlers.html (These could subclass or forward to the default handler, com.yahoo.search.handler.SearchHandler.)

Some projects just aren't large enough to warrant an entirely different Vespa cluster, which would increase operational complexity.

Yes, there's no low-effort middle ground here - sorry about that. We're in the process of opening up our own Vespa PaaS on AWS to external users, which would provide just that - let me know if you'd be interested in that.

jazoom commented 6 years ago

@bratseth thank you for the clarification. I guess I'll just not use the sources URL query parameter and stick with the YQL FROM sources parameter.

I see. It does not, because another typical use case is to search multiple content cluster (sources) in the same query.

So what exactly is the first music in /music/music/. It looks like something that can be used for namespacing.

I guess I'll just carefully name things in services.xml to create pseudonamespaces. I like to keep things as simple as possible while still achieving the same outcome.

We're in the process of opening up our own Vespa PaaS on AWS to external users, which would provide just that - let me know if you'd be interested in that.

That actually sounds awesome but it might not be for me. I know your PaaS is a bit different, but I vowed several years ago to run my own databases.

Many years ago I was using a DBaaS (Stackmob). After a while PayPal bought them, then with little notice shut it down. I had to quickly find and migrate to another solution for a live app. I chose Parse. After a while Facebook bought Parse and eventually shut it down. They gave 12 months notice, though, so at least I had more time.

That is when I made the vow.

The next database I set up was RethinkDB. It was easy to administer and an excellent database. After a few years the company behind RethinkDB folded.

But you know what? It doesn't matter, because I'm running my own RethinkDB cluster on a private network. It's humming along perfectly probably for as long as I need it.

Like I said, your PaaS is a little different, but I don't think I'll ever go back on that vow.

I've also grown quite fond of doing my own sys admin and having databases co-located with apps (I don't actually use AWS for any of their VM products).

Anyway, this is just an interesting anecdote to explain why I'm not super keen on what sounds like will be an awesome PaaS.

bratseth commented 6 years ago

So what exactly is the first music in /music/music/. It looks like something that can be used for namespacing.

Yes, indeed - that is a namespace for documents. http://docs.vespa.ai/documentation/document-api.html http://docs.vespa.ai/documentation/documents.html

PaaS: Ok, interesting story, and nothing I can say will make any difference here :-)

jazoom commented 6 years ago

Yes, indeed - that is a namespace for documents.

Based on the fact I was able to insert a document at /someslug/music/ and then insert the same document at /music/music/, despite me never specifying anything called someslug anywhere in the application, I guess I can just call it whatever I like, and have an unlimited number of namespaces?

From the link you shared:

Intended to be used to distinguish data from users who share the same Vespa cluster and/or distinguish between different document types in search. It is hence possible for various applications to use the same Vespa installation, ensuring they do not create document identifier collisions.

Well, distinguishing between document types is the role of the second part of that slug for REST and there is already a way in YQL to distinguish between document types as we were discussing earlier, so I'm not sure how this namespace helps that. Distinguishing between applications is exactly what I want, as we discussed before.

So is a possible solution just to handle namespacing at index time for each document? I can see how that would work for CRUD but where does this namespace fit into YQL?

bratseth commented 6 years ago

I guess I can just call it whatever I like, and have an unlimited number of namespaces?

Yup.

Namespaces are for the case where you want to separate documents even though they use the same document type.

Separating applications on the query side

The first two can be done in config by setting the parameter in the default query profile. The last can be done by a Searcher component.

jazoom commented 6 years ago

@bratseth Thank you for the clarifications and for your patience.

I think I'll go with different content clusters for now since that seems to be the simplest and satisfies my current needs.

I think I've finally got clear in my head how query sources, restrict and YQL sources work. I'll just ignore query sources, always set restrict to the document type I am looking for and always set YQL sources to be the cluster-id assigned to the external application making the call. I experimented just now and indeed errors are returned by Vespa (and no search results) if either of those 2 variables reference a doctype/cluster that doesn't exist, unlike query sources, which still returns results along with the error. But that's okay, since I won't be using that one anyway.

Oh, and I'll just always double up the namespace (like /music/music/) when inserting documents because I feel like that way of namespacing is for unusual use cases and would complicate things considerably. It seems for most things it would be far easier just to add a field to the document to use as a filter. That would prevent 2 documents with the same primary key being added, which this namespacing seems to allow.