src-d / engine-deprecated

[DISCONTINUED] Go to https://github.com/src-d/sourced-ce/
https://docs.sourced.tech/engine
Apache License 2.0
217 stars 26 forks source link

Parsing csharp files does not recognize the language #297

Open dpordomingo opened 5 years ago

dpordomingo commented 5 years ago

discovered at https://github.com/src-d/empathy-sessions/issues/37#issuecomment-470045137

Running parse over a .cs file is not working

$ srcd parse uast examples/example.cs
INFO[0000] detected language: c#                        
Error: language c# is not supported
Usage:
  srcd parse uast [file-path] [flags]
...

Workaround

Until this is fixed, csharp needs to be passed explicitly to bblfshd. For example:

srcd parse uast csharp.cs --lang csharp
smacker commented 5 years ago

This is a known problem in bblfsh:

We use supported languages API to provide meaningful error to the user.

So the question is: do we want to implement workaround inside engine specially for C#?

//cc @creachadair maybe you have some ETA for the proper fix on bblfsh side?

creachadair commented 5 years ago

My understanding was that the fix from https://github.com/bblfsh/bblfshd/pull/251 was supposed to address this without requiring a change to Engine. To really fix this we will need a solution for aliasing, for which we don't have a timeline yet (@bzz has started an RFC for the question, but we have had higher-priority work so far).

Is the above workaround not working for you?

smacker commented 5 years ago

it does not. We use enry + bblfsh supported languages api to check that bblfsh is able to parse a file. This api currently doesn't return c#, only csharp.

creachadair commented 5 years ago

it does not. We use enry + bblfsh supported languages api to check that bblfsh is able to parse a file. This api currently doesn't return c#, only csharp.

I see. So is the issue not that you're unable to get a parse for the file, but that the list of languages doesn't match the string on the Engine side so that Engine doesn't even ask?

smacker commented 5 years ago

@creachadair yes

jfyi workaround from above also doesn't solve the issue for srcd sql and for gemini (which isn't part of engine yet).

carlosms commented 5 years ago

I added this as a known issue to the release notes.

smacker commented 5 years ago

I would propose to return this issue to backlog and wait for the fix from LA team instead of implementing work around on our side due to:

@carlosms wdyt?

carlosms commented 5 years ago

Makes sense, let's wait for a proper fix, I don't think there is any rush.

carlosms commented 5 years ago

It also fails for cpp (c++ is not supported) and bash (shell is not supported).

campoy commented 5 years ago

It seems like C# is then not supported in any way other than language detection for the Engine. I'd consider simply dropping mentions to C# on the release notes for 0.11.

smacker commented 5 years ago

just for the record:

  1. it works for such query:
    SELECT UAST(f.blob_content, 'csharp') AS uast
    FROM refs AS r
    NATURAL JOIN commit_files
    NATURAL JOIN files AS f
    WHERE r.ref_name = 'HEAD' AND f.file_path REGEXP('.*.cs')
  2. it also works in bblfsh-web.
creachadair commented 5 years ago

It seems like C# is then not supported in any way other than language detection for the Engine. I'd consider simply dropping mentions to C# on the release notes for 0.11.

Dropping mentions is a safe choice. But note that the issue is in the language detection, not the functionality: Specifically the detection labels a C# file as "C#" which doesn't match Babelfish "csharp" and thus the file never makes it to the parsing step.

So this is a real bug, and needs a real fix, but the core functionality is solid. So I think we can demo the feature safely, modulo the need to work around that issue. Unfortunately, while the solution isn't difficult, it touches a bunch of plumbing, so it isn't as easy as patching one layer.

carlosms commented 5 years ago

I've tested it a bit more and this works with 0.11.0:

srcd parse uast csharp.cs --lang csharp

So the real limitation is that you can't rely on automatic language detection, but you can still get UAST if you set the language manually (like @smacker said for the SQL queries). I will add this information to the known issues section of the release notes.

I've also tried a quick workaround, in case we think it's worth to release engine again to fix this issue on our side until it is handled by bblfshd.

If we remove this check, these 3 cases would work:

srcd parse uast csharp.cs
srcd parse uast csharp.cs --lang c#
srcd parse uast csharp.cs --lang csharp
dpordomingo commented 5 years ago

why can't we check if one of the aliases of the required language is one of the exposed by bblfsh?

I mean:

Doing so, the filter will be still there, but working properly.

carlosms commented 5 years ago

As a workaround, yes that makes sense. Now the question is if we want to do a workaround, or just wait for the proper handling of aliases by bblfshd.

dpordomingo commented 5 years ago

Sorry, I missed something... I don't see why is it a problem on bblfsh... I ran bblfsh container with images:

$ docker run --rm -d --name bblfshd --privileged -p 9999:9432 bblfsh/bblfshd:v2.11.8-drivers

And then I asked for the uast of a c# file in all possible ways:

$ bblfsh-cli example.cs 
$ bblfsh-cli -l csharp example.cs
$ bblfsh-cli -l c# example.cs 

all of them worked.

It also worked for c++

$ bblfsh-cli example.cpp 
$ bblfsh-cli -l c++ example.cpp 
$ bblfsh-cli -l cpp example.cpp 

:thinking: not for bash

When I tried asking about a non existent driver language, it properly failed:

$ bblfsh-cli -l cobol example.cobol
couldn't parse example.cobol: rpc error: code = Unknown desc = unexpected error: runtime failure: missing driver for language "cobol"
dpordomingo commented 5 years ago

So my question is:

Does the filter makes sense in source{d} Engine?

bzz commented 5 years ago

for the proper handling of aliases by bblfshd.

sorry for butting in, but just for the context - as noted in https://github.com/bblfsh/bblfshd/issues/263 so far ETA for proper aliases in bblfsh is Q22019.

Meanwhile @dpordomingo suggestion https://github.com/src-d/engine/issues/297#issuecomment-473219312 on applying public enry API LanguageByAliasMap to the results of protocol.SupportedVersion before comparing it to enry.GetLanguage output inside the engine product sounds very safe and reasonable feature.

I would argue that this is different from what has been discussed in this thread before - it's not a workaround specific for C# and it should be working the same way and not require a change later, after introduction of proper aliases support on bblfsh side.

carlosms commented 5 years ago

So my question is:

Does the filter makes sense in source{d} Engine?

  • If we want to avoid sending files to bblfsh if we already know that there is no driver for it -> then yes: it makes sense, and we could enhance it as I explained.
  • If we don't care about sending files to bblfsh, we can drop it, and let bblfsh say that it could not parse it.

We didn't have this check on the installed drivers, then we added it after the discussion on https://github.com/src-d/engine/pull/162, creating https://github.com/src-d/engine/issues/163 & https://github.com/src-d/engine/pull/186. So yes, the only reason it's there is to show a nicer error message in case the user requests to parse a file for which we don't have a driver. Another possibility discussed in #186 was to send the parse request and then check the error returned, but that required support from https://github.com/bblfsh/client-go/issues/106.

creachadair commented 5 years ago

From the user's perspective, it'd be nice if we can avoid doing extra work for queries we know can't succeed. Once we have sorted out the mapping correctly, I think that's a reasonable tactic. But early filtering does force a coupling between the two APIs that—in retrospect—I suppose we didn't think through clearly enough.

In the meantime: I'd like to make sure the user gets an "understandable" result even if we have to do a little extra work (e.g., attempt to parse a file that a human might know won't work). The question for the UI, though, is how to tell the difference between "this failed because I couldn't find a driver for it" (which is what the pre-filter is a proxy for) and "this failed because the driver reported an error".

I should know but do not: Do we already use canonical error codes to distinguish cases like this, which could inform the UI how to give feedback to the user when a parse fails?

marnovo commented 5 years ago

Workaround from v0.12 release notes:

Known Issues

297: srcd parse does not detect the language automatically for C#, C++, or bash files. For these languages you will need to set --lang manually. For example:

$ srcd parse uast file.cs --lang csharp
$ srcd parse uast file.cpp --lang cpp
$ srcd parse uast file.bash --lang bash