Semantic search engine - Githubissues

kiritsuku commented 8 years ago

We need a semantic search engine for all of our tools. The key requirements are:

Index arbitrary semantic data
Provide a query API that allows users to search for the data they are interested in
Provide the search engine as a standalone library and as a web service
The index should be extensible in what can be indexed and how data can be classified with semantic information

In order to limit the scope of such a search engine, at the beginning we should focus at indexing concrete syntax trees (CST) for the programming language Scala. Indexed CSTs would already allow us to implement a lot of useful features including but not limited to:

Find references to arbitrary symbols (identifiers with semantic information)
Build a type hierarchy
Dependency hierarchy of projects
Show implementation classes of interfaces/traits/abstract/classes
Show super implementation of overridden symbol
Sort potential code completion matches

A selection of projects that could need such a search engine:

scala-ide - The IDE already supports some of the features mentioned above but they all rely on the search engine that comes with Eclipse, which is not powerful enough to support all kind of use cases.
scala-search - It was created to supersede the search engine that is shipped with scala-ide but it has severe limitations in so far that its goal is to only index trees. Also, the project is not standalone and therefore its indexer can't be used outside of Eclipse.
intellij-scala - Similar to the scala-ide case but IntelliJ already has its own indexer for CSTs. Their indexer is still not powerful enough to index all kind of information but it is difficult to build such an indexer anyway. Since reusing the indexer of IntelliJ would limit us in prototyping a semantic search engine for everything, we shouldn't rely on it.
scala-refactorings - Refactorings right now depend on scala.reflect trees, which are ASTs and not CSTs. We could depend on scala.meta trees but the project is not there yet. Also, scala.meta provides only a CST for Scala and not an abstraction for all kind of programming languages or let alone for information that is not part of trees. The query API however would be an abstraction that could be used to search for the information that is needed by the refactoring implementations. And it would make some refactorings even programming language independent, since the query API abstracts over languages.
ensime-server - Similar cases as for the IDEs, the Graphpocolypse proposal shows that they also need something better.
scaps - It relies heavily on indexing information that comes out of trees. Right now this project has two goals: Provide a search engine and classify the trees in a way that users can find what they need. The latter is more important for its users, the indexer is just there to speed the whole thing up. Each use case should be handled by a different project to make maintenance easier, where the indexer is what is described by this proposal and should be moved out of scaps.
scala.meta - For scala.meta it would be interesting to feed the indexer with its trees and to get them back again when they are needed. This would be mostly interesting for users who don't want to use the query API but who want to work directly on top of trees that are generated from a cache.

The goal of the semantic search project is to provide an abstraction that can be used by all of the tools mentioned above. It shall supersede all existing indexers used by any of the tools above and provide an API that makes it possible for these tools to rely on this single uber indexer.

I have the chance to work several months full-time on this project and would like to explore the technical challenges that need to be solved in order to get such a uber indexer. Since I need to get something done the main goal is to first get the indexer up and running for the needs of various Scala tools and if further time remains to find out how to move it to the next level.

I would like to get feedback by all maintainer of the tools mentioned above, namely @fommil @mlangc @misto @xeno-by @smarter @dragos @rorygraves @Luegg @wpopielarski

fommil commented 8 years ago

I think it would be very useful to have access to a common indexing scheme like this in ENSIME, but we'd want to embed it just like we plan to do with scaps.

We have endless performance problems on extremely large projects, which has shaped our current design.

This also touches on areas of "when to update the search database", are you planning on tackling that too? If you allow yourself to use Java 7, then things become a lot easier (but still not ideal on Windows)

We'll have a clearer idea of our full API requirements when we move onto the graph. Right now, we're having problems as a result of Symbol / FQN resolution, so I'm working on that in a dev branch. If you could provide even more semantic information, that would be interesting.

Have you got some usecases? e.g. user queries that are not possible today.

kiritsuku commented 8 years ago

On 02/19/2016 02:14 PM, Sam Halliday wrote:

I think it would be very useful to have access to a common indexing scheme like this in ENSIME, but we'd want to embed it just like we plan to do with scaps.

Why would you prefer embedding? It is definitely a use case I want to support but I would like to find out why embedding is is more important to you.

We have endless performance problems on extremely large projects, which has shaped our current design.

This also touches on areas of "when to update the search database", are you planning on tackling that too?

I guess I have to. If there are changes anywhere we don't want to re-index everything and we probably also don't want to index everything that could be indexed. Scaling up to big codebases is a need every tool for semantic information has. During the hackathon on scalasphere, we had the discussion how IntelliJ can be as fast as it is. It probably indexes in multiple steps. The first step should be fast and therefore only index the information that is immediately available (like class names, which can be seen just by looking at a token stream). A later step can be more accurate and therefore a lot slower (like find references, which needs to know types).

If you allow yourself to use Java 7, then things become a lot easier (but still not ideal on Windows)

Why Java 7? I guess as long as I don't use dependencies that need Java 8 and as long as I stay away from Scala 2.12 I should be fine even with Java 6.

We'll have a clearer idea of our full API requirements when we move onto the graph. Right now, we're having problems as a result of Symbol / FQN resolution, so I'm working on that in a dev branch. If you could provide even more semantic information, that would be interesting.

The main problem is that it is difficult for a single project to get all technical challenges right. With this proposal I would like to join our forces towards a single implementation that can be used by everyone. Incremental updates, fast updates, non blocking queries, data compression, data classification and so on are really hard to get right and I would like to have these things in one place.

Have you got some usecases? e.g. user queries that are not possible today.

My thoughts go in mainly two directions:

The first one is to not throw away all of the information our tools have. If we would index snapshots of the trees that are in the compiler the moment when an error happens, we could track down how users solve an error. If this information would be available in a globally wide accessible database, every time a user gets an error the compiler could do an additional lookup at the database in order to find out if someone else already had such an error. In this case the compiler could give in the best case very precisely on how to resolve the error. This feature is kind of complicated because it requires compiler support. Compilers especially are not allowed to just report errors as string but as real semantic objects. Unfortunately neither scalac nor dotc are doing that right now. But once I have the semantic indexer, I hope to have some good arguments to get support for that in the compiler.

The second one is to index information as semantic data, which we see right now only as syntactic data. If for example someone writes a blog post about how to use a library it is difficult to associate this information with the use case of another user. Traditional search engines allow us only to search for keywords but they usually can't tell us if the blog post mentions an outdated API or how to download the dependency for the mentioned library. At the beginning, it would be nice if users could tag arbitrary web resources with such information. For example if there is a code example they could tell the indexer how to compile it and later search queries would be able to tell users more precisely what they want to know. At one point it may even be possible to automatically associate this syntactic data with semantic data but that lies too far in future to tell how such a mechanism could be implemented.

Both use cases are not possible today and aren't definitely anything I can work on in the next months. But I hope once I have access to a better semantic search engine I can explore these directions. I can think of a few more use cases, but they are even more crazy and therefore make it difficult for me to even write them down. ;)

fommil commented 8 years ago

Embedding because I don't want another JVM process to deploy and manage, that costs heap and complicates the install for users.

Java 7 means NIO and performance and async file watchers.

FYI, ENSIME indexing is way faster than IntelliJ and Scala IDE on my work project. We've really optimised the hell out of it, so you've got a lot of catching up to do :smile:

kiritsuku commented 8 years ago

@fommil Regarding graph databases: Can you explain why you think they are better than relational databases? Sure, the model we have to work with are trees and at some points even graphs but I wonder what bad experiences you made by mapping that down to a relational structure.

fommil commented 8 years ago

Performance, memory size and extensibility. With relational databases, a table needs to be created for each feature, there is huge duplication of data (even just the extra keys). With a graph database, I'm hoping we can create the schema and write new features later.

kiritsuku commented 8 years ago

Ok, thanks. That is also what I was thinking about relational databases. Now I have some questions about the current design of Graphpocolypse:

Do you track where an implicit conversion is used?
Inferred types, how do you know about them. Do you typecheck everything to get the information or do you take some shortcomings here?
How do you handle updates to the database? Say, someone changes a method signature, you would also have to update it in all subtypes of a class. Do you rely on sbt's incremental compilation feature to find out which files need to be recompiled and then re-index all of these files?

fommil commented 8 years ago

no, we don't look at source code at all. That'd be nice. We would only track an implicit conversion usage like any other method call or object reference.

kiritsuku commented 8 years ago

Alright, so you only index the stuff that is part of the bytecode? How do you handle updates then? Do you re-index all class files that changed?

kiritsuku commented 8 years ago

@Luegg I guess you didn't have to think a lot about updating your indexer yet since you are only indexing libraries. Are you already done with your thesis and is it publicly available?

fommil commented 8 years ago

yeah we have file watchers for that. It's not just the bytecode though, we also have the scala pickles for type info. File handling is extremely "interesting" on Windows, a file handle will mean that nobody can move/modify/delete that file... which is very inconvenient if you are reading a classfile and the compiler decides that file is stale and needs to be rewritten.

It would be incredibly expensive to run the presentation compiler over all source code. The very design of ENSIME is to avoid doing this as it is insanely expensive and prohibitive for large projects. Therefore we try our best to work out semantic information lazily by working out which source files to load to get their typed trees.

Luegg commented 8 years ago

@sschaef You're right, up to now this was not necessary. And, in fact, efficiently updating the index is quite hard. See also the discussion at https://github.com/ensime/ensime-server/issues/472. The main issue are frequency statistics that have to be more or less accurate to get precise search results.

Altogether, having type informations readily available would be really convenient for Scaps. Though, I use some custom data structures to speed up query analysis and therefore need "low level" access to the index.

I will upload the thesis tomorrow.

kiritsuku commented 8 years ago

@MasseGuillaume You may also be interested in this project. Would you need such an indexer in metadoc? Did you already work on an indexer by yourself?

MasseGuillaume commented 8 years ago

@sschaef Every Scala project is interesting :-).

My goal with https://github.com/metadoc/metadoc was to approach the indexing problem top down. First get a list of all Scala project publicly available. Fork the project to add a sbt-plugin to index it. For the indexing part I was a bit stuck with the serialization of scalameta trees.

I'm working in a startup now, so I have really limited extra coding time. In the next month, I will put more emphasis on ScalaKata. I'm adding collaborative editing and printing the type of an expression for 1.0.8.

kiritsuku commented 8 years ago

Any special features you plan to implement once you have the index available? My plan is to provide an index+query engine not just for syntactic but also for semantic data. Therefore I would like to know beforehand which use cases people have.

Luegg commented 8 years ago

@sschaef My thesis is now available at http://about.scala-search.org/thesis.pdf. Though, the description of the model is not very formal and some aspects are already outdated. But the overall idea remains the same. I hope it is of any help to you. A more formal paper of the approach should follow.

kiritsuku commented 8 years ago

Cool stuff, thanks. Be prepared that I'll come back to you with questions the paper can't answer. ;)

smarter commented 8 years ago

Compilers especially are not allowed to just report errors as string but as real semantic objects. Unfortunately neither scalac nor dotc are doing that right now.

I encourage you to open an issue on https://github.com/lampepfl/dotty/issues/ which explains in details what these semantic error messages should be like and what the use cases are. This looks like something that could be done by new contributors since it shouldn't require any deep knowledge of the compiler's internals.

kiritsuku commented 8 years ago

I may do some one day but not today. As long as the indexer doesn't work for me, I don't have a use case by myself (well, actually I would have one in the IDE but I don't have the time to implement it) and since no one showed up so far to implement such a feature in the compiler, I guess no one will show up within the next weeks either.

kiritsuku commented 8 years ago

I had a look at different implementations and their thesis of the projects mentioned above.

Ensime/Graphpocolypse: Can only search bytecode and is therefore severely limited by design. Bytecode searching is extremely fast (at least compared to scalac typechecking) and is therefore still useful for a multi level indexer (a indexer that indexes some information first and more information afterwards).
scala-search: It can't index bytecode but only Scala sources and even here it can only index identifiers. The typechecker is still required since a query only searches potential identifiers and then asks the compiler if the identifier match the searched symbol. This makes it quite fast but every interaction with the compiler slows it down unnecessarily nevertheless. Also, since only identifiers are indexed, it is for example not possible to search for inferred types.
scaps: It is more powerful than scala-search, since it indexes a lot more information. It still doesn't index the entire tree, which means that there is still a lot of information thrown away. The largest limitation right now is that it can't handle constant updates of the index.

The Ensime approach is interesting because of its bytecode indexing, the approach can be reused. scala-search is mostly interesting for its scala-ide integration, the indexing backend however is too limited and can't address my interests. scaps is most interesting to me because of its need to ask the indexer multiple times for information for only a single query. It may be a design challenge to get an indexer that is reasonable fast while allowing users to merge or filter information from multiple queries on top of abstract or even concrete syntax trees. It is also questionable if we want to index a concrete syntax tree at all or if we just want to store a snapshot of a tree to give users the possibility to retrieve it and process it further.

kiritsuku commented 8 years ago

I started working on this. A summary of the plan where I want to go:

Since I want to index arbitrary data, I decided to use semantic web technologies for the indexing engine. They were created for this use case and semantic web is the technology that is supposed to have the most success in the coming years. This means technologies like RDF and OWL will be used to build up ontologies and reference data.
SPARQL and Datalog will be used to create queries against the index. Once they can be used to query data, high level programming language dependent APIs can be used for querying. In Scala one could use a library like quill to create such queries.
Apache Jena already supports all mentioned technologies above and therefore I will use it. With banana-rdf there even exists a Scala library for RDF.
I will start to index typed scalac trees. Provided as a compiler/sbt plugin this means we get incremental updates for free. Once integrated into IDEs, we can get even better incremental updates for free. Once scala.meta, dotc or tasty trees become a decent alternative, we can start indexing them as well. Since there is an indirection through SPARQL/Datalog between the user API and the indexed information this shouldn't affect user queries at all.
Once scalac trees are indexed, I can start thinking where to integrate the. IDE integration is the main motivation of course but Github and StackOverflow integration (or better: enrichment through semantic features) would probably get more attention.

xeno-by commented 8 years ago

@sschaef Sounds great! Looking forward to learning more as things unfold.

dragos commented 8 years ago

The plan sounds good, but I would suggest you think really hard about using RDF/OWL. They are dead. I would be very careful about any technology that appeared during the XML bubble. :) Even if it weren't... the goals of OWL are sufficiently different to make it awkward to use for representing programs.

kiritsuku commented 8 years ago

Can you expand on this? Any specific issues you have in mind that could make problems?

dragos commented 8 years ago

It's mainly about the cost of using it, for unclear gains.

it's too abstract/generic. It can model anything. So you'd need to define a model in their meta-model for Scala programs.
additional complexity. It's a W3C standard that has way more than you'd need
what are the gains? Any tools out there that would make it worth it for people to cope with the additional complexity?
Scala models are static, while OWL aims at modelling information that changes frequently
it's XML based. Maybe there are other syntaxes now, but that's what people associate it with
nobody is using it, and those who did, gave up.

I simply think there's an impedance match.

kiritsuku commented 8 years ago

Alright, thanks for the clarifications. There are other technologies, like http://schema.org/, Microdata or JSON-LD, which seem to be simpler to create knowledge graphs or linked data. But I first have to get familiar with all that stuff and find out how to combine that with querying technologies like SPARQL or Datalog. As long as I don't leave prototype area, switching to a new technology is a no-brainer anyway.

MasseGuillaume commented 8 years ago

RDF is also used to link against another knowledge graph.

An example of linking knowledge between statically defined Scala and another kind of knowledge (http documentation) would be rho

harveywi commented 8 years ago

Are you familiar with David Spivak's "olog" (https://en.wikipedia.org/wiki/Olog)? I was doing some reading in category theory a while back and ran across it. The basic idea is that category theory can be used to encode knowledge. From Wikipedia:

The motivation behind introducing ologs is to provide a rigorous mathematical framework for knowledge representation, construction of scientific models and data storage using linguistic (we use the English language as an example in this article) and graphical tools. We will refer to the olog above in the remainder of the article.

This may not meet your needs, but you may find a kernel of something interesting in there. For a comprehensive overview of the subject, take a look at Spivak's "Category Theory for the Sciences" (https://mitpress.mit.edu/books/category-theory-sciences). Plus you would get a bonus point for Conway's Law.

kiritsuku commented 8 years ago

On 03/16/2016 08:15 PM, William Harvey wrote:

Are you familiar with David Spivak's "olog" (https://en.wikipedia.org/wiki/Olog)?

No, I'm not. I'm not even familiar with category theory in the first place. Therefore, there is probably not much I could get out of this.

kiritsuku commented 8 years ago

Small update about what I managed to do during the last week:

I created a converter for a small subset of Scala, which can extract information out of scalac trees
I indexed the extracted data by using the data language JSON-LD
I can query the data through SPARQL

Here is a test that shows the whole thing in action:

  @Test
  def find_all_methods_of_single_class() = {
    val modelName = "http://test.model/"
    ask(modelName, convertToHierarchy(
      "<memory>" → """
        package a.b.c
        class C1 {
          def m11 = 0
          def m12 = 0
        }
        class C2 {
          def m2 = 0
        }
        class C3 {
          def m3 = 0
        }
      """), s"""
        PREFIX c:<$modelName>
        PREFIX s:<http://schema.org/>
        SELECT ?member WHERE {
          ?class c:tpe "class" .
          ?class s:name ?className .
          FILTER (str(?className) = "C1") .
          ?member c:parent ?class .
        }
      """) === Seq(
        Data("member", s"${modelName}_root_/a/b/c/C1/m11"),
        Data("member", s"${modelName}_root_/a/b/c/C1/m12"))
  }

The next thing to do is to improve the converter and the indexer in a way that hey can handle more use cases.

xeno-by commented 8 years ago

Are you doing semantic analysis of the code or just parsing it for now?

kiritsuku commented 8 years ago

Semantic analysis. I operate on the typechecked trees and get additional info out of symbols and types.

kiritsuku commented 8 years ago

A quick update: I'm still working on converting scalac trees to a data model that can be easily queried. It takes way more time than I would like to invest but I can't do anything against it - Scala as a language is ridiculously difficult to get right by a tool. And scalac trees don't make life easier. Anyway, that is why I'm doing all this - any other approach is simply not enough to handle Scala well.

My plan right now is to invest two further weeks in hardening the converter - if I'm still not able to handle 100% of the language by then (which will be likely but I would already be happy with 99%), I will simply ignore the remaining use cases and start implementing some features that use the new query API. At least, find declarations, find references, find identifier regions and variations of these features work already very well.

xeno-by commented 8 years ago

Do you happen to also undo desugarings, or you convert the code as is?

kiritsuku commented 8 years ago

Yes, I care about desugarings but I don't preserve syntactical details. I'm interested in any semantic information I can get, which means that Option(1) would give me the information that an apply method is used and that it is parameterized by an Int. Here is a test that shows it in action:

  @Test
  def implicit_apply_method() = {
    ask(modelName, s"""
        PREFIX c:<?MODEL?>
        PREFIX s:<http://schema.org/>
        SELECT * WHERE {
          [c:tpe "ref"] s:name ?name ; c:start ?start ; c:end ?end .
        }
      """,
      "<memory>" → """
        class X {
          val [[!Option]]a = [[!apply]][[!Int]][[Option]](1)
        }
      """)
  }

It is not perfect yet as one can see at the [[!Option]]. There is a [[!Int]] missing, which would mean that the type is Option[Int] and not just Option.

Actually, I'm mostly doing the same what the semantic model of scala.meta is doing but I look at it from a different angle. Instead of using a CST, I use linked data (i.e. a triple store) and I index less information.

This has the disadvantage that I can't provide foundations for tools that need knowledge about syntactical details but this is not a hard limitation. I could index syntactical information and I'm probably going to do so in future but not for now. Another disadvantage is that matching on overly complex pattern is more difficult since I index only a simplified model of the world (of the trees). But once again that is not a hard limitation since I can always index more information.

Advantages to my approach are that I have an easy model to reason about (a triple store) and that I have an index to search for semantic information. It is a simpler model than a CST and also a more powerful model since it abstracts over semantic data of different contexts. I can index and represent arbitrary languages and not just Scala. My goal is to find a minimal model for Scala and for other programming languages (and not just for programming languages). Right now I'm jumping between "I have to index more information to preserve semantics" and "I just found a simplification that allows me to index less data".

xeno-by commented 8 years ago

Could you elaborate on the data model that you're using?

kiritsuku commented 8 years ago

Sure, it is actually quite easy. The data model is named "Linked Data" but I'm not sure if this term is well defined. In the Semantic Web, Linked Data just describes triples with the following proporties:

The triple has three fields (well, that is obvious): (s, p, o). The fields are referred as subject, predicate and object.
The subject is a unique identifier, a URI.
The predicate and object form the a key-value relationship, where the object field can reference to other subject fields. This means that the data can be linked together through the subject fields.

The Scala source class X { val a = Option(1) }, which I showed earlier could be represented in a tabular way as something like this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| s                                                                                                             | p                                                 | o                                                                                             |
=====================================================================================================================================================================================================================================================================
| <http://test.model/_root_/X>                                                                                  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/X>                                                                                  | <http://schema.org/name>                          | "X"                                                                                           |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/attachment>                    | "class"                                                                                       |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/end>                           | 16                                                                                            |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/owner>                         | <http://test.model/_root_>                                                                    |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/start>                         | 15                                                                                            |
| <http://test.model/_root_/X>                                                                                  | <http://test.model/tpe>                           | "decl"                                                                                        |
| <http://test.model/_root_/X/a>                                                                                | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/X/a>                                                                                | <http://schema.org/name>                          | "a"                                                                                           |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/attachment>                    | "val"                                                                                         |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/end>                           | 34                                                                                            |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/owner>                         | <http://test.model/_root_/X>                                                                  |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/start>                         | 33                                                                                            |
| <http://test.model/_root_/X/a>                                                                                | <http://test.model/tpe>                           | "decl"                                                                                        |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://schema.org/name>                          | "Option"                                                                                      |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/attachment>                    | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/end>                           | 33                                                                                            |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/owner>                         | <http://test.model/_root_/X/a>                                                                |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/start>                         | 33                                                                                            |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/tpe>                           | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/33>                                                       | <http://test.model/reference>                     | <http://test.model/_root_/scala/Option>                                                       |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://schema.org/name>                          | "Int"                                                                                         |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/attachment>                    | "ref"                                                                                         |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/end>                           | 37                                                                                            |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/owner>                         | <http://test.model/_root_/X/a>                                                                |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/start>                         | 37                                                                                            |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/tpe>                           | "ref"                                                                                         |
| <http://test.model/_root_/scala/Int/%3Cmemory%3E/37>                                                          | <http://test.model/reference>                     | <http://test.model/_root_/scala/Int>                                                          |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://schema.org/name>                          | "Option"                                                                                      |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/attachment>                    | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/end>                           | 43                                                                                            |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/owner>                         | <http://test.model/_root_/X/a>                                                                |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/start>                         | 37                                                                                            |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/tpe>                           | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>                                                       | <http://test.model/reference>                     | <http://test.model/_root_/scala/Option>                                                       |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://schema.org/Text>                                                                      |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://schema.org/name>                          | "apply"                                                                                       |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/attachment>                    | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/end>                           | 37                                                                                            |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/file>                          | "<memory>"                                                                                    |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/owner>                         | <http://test.model/_root_/X/a>                                                                |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/start>                         | 37                                                                                            |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/tpe>                           | "ref"                                                                                         |
| <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B/%3Cmemory%3E/37> | <http://test.model/reference>                     | <http://test.model/_root_/scala/Option/apply%28Ljava%2Flang%2FObject%3B%29Lscala%2FOption%3B> |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This is not a perfect way to describe it but the one which I use right now. As an example, the triple

s: <http://test.model/_root_/scala/Option/%3Cmemory%3E/37>
p: <http://test.model/reference>
o: <http://test.model/_root_/scala/Option>

represents a reference to the type scala.Option, which has the identifier/URI http://www.test.model/_root_/scala/Option/<memory>/37, where <memory> is the file where this reference is used and 37 the offset in the file. Other triples tell us that this reference is implicit, i.e. it doesn't appear in source and that it is the type of a val. Such information can be inferred based on the semantic model (i.e. the schema definition also know as ontology) behind the data.

For Linked Data it is not a requirement that the subjects and predicates are URIs, this is a requirement of the Semantic Web. I nevertheless chose this requirements because they will give me a technical advantage in future. This way, I can make the content trivially accessible through web technologies and even better I can index anything that can show up in the web and not just what the Scala compiler produces.

xeno-by commented 8 years ago

Somewhat reminds me of Kythe. What kind of predicates do you currently support?

Also, what do you plan to do when a position alone isn't enough to identify a subject? E.g. what if we have foo.bar(baz) that gets typechecked into implicitConv[Foo](foo)(implicitArg1, implicitArgs2).bar[Baz, Quux](baz)? In this example, multiple synthetic trees end up at the same positions.

fommil commented 8 years ago

what do you do if you don't have access to the source code?

kiritsuku commented 8 years ago

On 04/07/2016 03:09 PM, Eugene Burmako wrote:

Somewhat reminds me of Kythe. What kind of predicates do you currently support?

Only a few and only predicates that were introduced because of the needs of Scala. I haven't figured out yet how to scale that up to arbitrary languages even though, in the essentials most turing complete languages can be described through a very simple model (at least this is what I'm thinking about it so far, see below).

Kythe is an interesting project but I think their foundations are screwed. Their entire platform is just too limited - a problem which I hope not to have since it is "webscale". Nevertheless, I first need to find out how well that really works in practice.

Also, what do you plan to do when a position alone isn't enough to identify a subject? E.g. what if we have |foo.bar(baz)| that gets typechecked into |implicitConvFoo(implicitArg1, implicitArgs2).barBaz, Quux|? In this example, multiple synthetic trees end up at the same positions.

Multiple objects ending up at the same positions is not a problem. However, what is a problem is the fact that right now I do not yet index a qualifier, i.e. a mechanism that would allow me to find out how implicit references build upon each other. The Scala code for my model looks something like this (simplified):

sealed trait Hierarchy
case class Decl(name: String, owner: Hierarchy) extends Hierarchy
case class Ref(name: String, refToDecl: Hierarchy, owner: Hierarchy, 
qualifier: Hierarchy) extends Hierarchy

The idea is that everything is either a declaration or a reference, which is true not for Scala but more or less for every turing complete language. The qualifier at the Ref is something I ignore for now but I would need it to correctly disambiguate the use case above.

xeno-by commented 8 years ago

Multiple objects ending up at the same positions is not a problem.

How's that not a problem? What URLs would you give such subjects?

kiritsuku commented 8 years ago

On 04/07/2016 03:19 PM, Sam Halliday wrote:

what do you do if you don't have access to the source code?

Nothing. At least for now. Since I can index anything, I'm planning to index bytecode. Thankfully, this is kind of an easy task, since bytecode is way simpler than scalac trees. On the other side, it can only provide a poor semantic model for Scala code. For Java code it works better. It would nevertheless help to answer interesting questions like "which library defines type X or method y".

xeno-by commented 8 years ago

If we do AST persistence (something that I plan to have in scala.meta 0.2), then you won't have to index bytecode - you'll be able to deserialize the trees and then index them instead. Would that be something of interest to you?

kiritsuku commented 8 years ago

On 04/07/2016 04:02 PM, Eugene Burmako wrote:

Multiple objects ending up at the same positions is not a problem.
How's that not a problem? What URLs would you give such subjects?
We would have to include more information to disambiguate the situation. For parameters we could include the position of the parameters, which would always lead to a unique identifier. For now I only disambiguate based on the position, which means that I can't represent multiple implicit parameters in the index yet. But that shouldn't be too painful to fix.

kiritsuku commented 8 years ago

On 04/07/2016 04:25 PM, Eugene Burmako wrote:

If we do AST persistence (something that I plan to have in scala.meta 0.2), then you won't have to index bytecode - you'll be able to deserialize the trees and then index them instead. Would that be something of interest to you?

Yes, that would be very welcome. But it would only work for newly published artifacts not for already published ones. Therefore, I can't get around to also index bytecode.

kiritsuku commented 8 years ago

Update: In the last weeks I was busy adding a web frontend. It is now possible to

run SPARQL queries
upload data
observe the system

through websockets and HTTP requests. Nothing spectacular but makes the whole system usable outside of unit tests. Until ScalaDays Berlin I want to do some polishing of everything. While the code is far away from being production ready I should be able to give a small introduction into the functionality at ScalaDays for everyone who is interested.

xeno-by commented 8 years ago

Great news! Looking forward to meeting you in Berlin!

kiritsuku commented 8 years ago

Nearly a month has been past since my last update and not a lot of changes did happen. I started to integrate semantic web stronger into the platform and as it turned out I had no idea what I were doing. I basically still have no idea what I'm doing but at least I figured out a way to move forward. I hoped to be able to show a cool demo on ScalaDays but it won't fly. There are still too many theoretic problems that need to be solved. I first have to solve them before I can continue to build cool stuff upon them.

Martin said, it took them more than 8 years to figure out how to make dotty to what it is today and they still don't have a public release yet. That gives me some comfort given that I'm already half a year in but still can't show anything great. Nevertheless, on ScalaDays I would be happy to talk about what I have achieved so far. Meet me there whoever is interested in finding out more!

kiritsuku commented 8 years ago

A long time has passed since my last update, this time I can say: it is done! My original idea of implementing a semantic search engine turned out to be a success, see this short demo for a first result: https://twitter.com/sschaef_/status/775118331351359488

Sure, there is still a long way to go. But I moved a step further to make my dream of my very own IDE reality. Therefore, I close this issue. Further progress can be seen by watching this repo.

fommil commented 8 years ago

woohoo!

xeno-by commented 8 years ago

\o/ Did you manage to get the converter running for all pieces of Scala syntax? Or did you limit yourself to a selected subset?

kiritsuku / amora

Semantic search engine #13