src-d / engine-deprecated

[DISCONTINUED] Go to https://github.com/src-d/sourced-ce/
https://docs.sourced.tech/engine
Apache License 2.0
217 stars 26 forks source link

parse lang command returns blank results #277

Open mcarmonaa opened 5 years ago

mcarmonaa commented 5 years ago

Related to the empathy session.

Running this query:

/* Top languages by repository count */
SELECT *
FROM (SELECT language, COUNT(repository_id) AS repository_count
      FROM   (SELECT DISTINCT
                r.repository_id,
                LANGUAGE(t.tree_entry_name, b.blob_content) AS language
              FROM   refs r
                      JOIN commits c ON r.commit_hash = c.commit_hash
                      JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
                      JOIN tree_entries t ON ct.tree_hash = t.tree_hash
                      JOIN blobs b ON t.blob_hash = b.blob_hash
              WHERE  r.ref_name = 'HEAD') AS q1
      GROUP  BY language) AS q2
ORDER  BY repository_count DESC

I noticed the result returns a blank language in the second position

+-------------------+------------------+
|     LANGUAGE      | REPOSITORY COUNT |
+-------------------+------------------+
| Ignore List       |                6 |
|                   |                6 |
| Text              |                6 |
| Markdown          |                6 |
| JSON              |                6 |
| YAML              |                5 |
| Dockerfile        |                5 |
| INI               |                5 |
| Shell             |                5 |
| HTML              |                5 |
| Java              |                5 |
| Makefile          |                4 |
| JavaScript        |                4 |
| Python            |                4 |
| C                 |                4 |
| XML               |                4 |
| TOML              |                3 |
| Go                |                3 |
| Protocol Buffer   |                3 |
| SVG               |                3 |
| Groovy            |                3 |
| Unix Assembly     |                3 |
| Gradle            |                3 |
| Batchfile         |                3 |
| Java Properties   |                3 |
| Ruby              |                3 |
| CSS               |                3 |
| SQL               |                3 |
| Smarty            |                3 |
| Vim script        |                2 |
| CSV               |                2 |
| Git Config        |                2 |
| reStructuredText  |                2 |
| Git Attributes    |                2 |
| Perl              |                2 |
| Maven POM         |                2 |
| AsciiDoc          |                2 |
| XSLT              |                2 |
| PLSQL             |                2 |
| FreeMarker        |                2 |
| Java Server Pages |                2 |
| Kotlin            |                2 |
| PLpgSQL           |                2 |
| Less              |                1 |
| HAProxy           |                1 |
| PowerShell        |                1 |
| R                 |                1 |
| Ant Build System  |                1 |
| Scala             |                1 |
| Roff              |                1 |
| Yacc              |                1 |
| RMarkdown         |                1 |
| HTML+Django       |                1 |
| Thrift            |                1 |
| AspectJ           |                1 |
| Csound            |                1 |
| GAP               |                1 |
| SQLPL             |                1 |
| HTML+ERB          |                1 |
| HiveQL            |                1 |
| q                 |                1 |
| ANTLR             |                1 |
+-------------------+------------------+

These are the list of repositories I'm using:

I found that using srcd parse lang on this file and this file return nothing.

Not sure if this is a bug or not.

carlosms commented 5 years ago

I think it makes sense to return an empty string in LANGUAGE() when it cannot be detected. You can always add a WHERE language <> '' if you need to filter them. What do you think @ajnavarro?

ajnavarro commented 5 years ago

Yep, no lang detected is an empty string for enry, so we are returning that.

Edit:

We return NULL if no lang is detected by enry:

    lang := enry.GetLanguage(path, blob)
    if lang == "" {
        return nil, nil
    }

So that empty result might be a null

dpordomingo commented 5 years ago

@carlosms is this an issue for Engine or for Gitbase?

carlosms commented 5 years ago

I think it's not a bug. If anything, we could edit the query example in gitbase-web to filter out empty languages... What do you think @mcarmonaa?

mcarmonaa commented 5 years ago

I thinks is a good idea adding a filter for empty languages, it'd play also as an example/documentation for this specific case which couldn't seem obvious at a first glance.