microsoft / multilspy

multispy is a lsp client library in Python intended to be used to build applications around language servers.
https://www.microsoft.com/en-us/research/publication/guiding-language-models-of-code-with-global-context-using-monitors/
MIT License
39 stars 7 forks source link

Some tips for adding new languages ? #5

Open mrT23 opened 1 week ago

mrT23 commented 1 week ago

Hi, and thanks for the excellent repo.

Can you share some tips on the process of adding a new language ?

For example, if I want to add support for javascript, where should I start ?

LakshyAAAgrawal commented 1 week ago

Steps to add a new language to multilspy:

  1. Identify a target language server. There are several language servers available for most languages, and you should determine that for your usecase, which is the most appropriate language server (system/architecture support, features supported, performance, etc). You could use https://langserver.org/ or https://microsoft.github.io/language-server-protocol/implementors/servers/ to find the target.
  2. Create a copy of one of the current language server modules in multilspy (under https://github.com/microsoft/multilspy/tree/main/src/multilspy/language_servers) and rename it to "". I would recommend starting with a copy of https://github.com/microsoft/multilspy/tree/main/src/multilspy/language_servers/rust_analyzer as that should be easier to read.
  3. Update /.py file: Specifically
    1. update the "setup_runtime_dependencies" method to ensure that the environment is setup for executing the language server binary. You can download/install the binaries in this method. Please store all URLs in the file /runtime_dependencies.json.
    2. Update the init method to set the path of the runtime binary that will be launched.
  4. Update src/multilspy/language_server.py:init method to import your language server when the language is
  5. At this point, you can try to launch the server and perform a request. The currently supported set of lsp requests are implemented in https://github.com/microsoft/multilspy/blob/main/src/multilspy/language_server.py. You can add support for a specific feature if not already present.
  6. Write a few unit tests for the newly added language in https://github.com/microsoft/multilspy/tree/main/tests/multilspy.

At step 5, you may need the complete server logs to debug. You can pass a logger to the init method for the same.

Please feel free to reach out to me if you face any issues.

mrT23 commented 1 week ago

Thanks a lot for the feedback ! will try

imanewman commented 1 week ago

@mrT23 If you end up getting Javascript support working on your own fork, I'd love to see as I will be needing the same in the coming weeks!

@LakshyAAAgrawal And thank you for the excellent repo, it has saved me a lot of time with a system that requires accurately parsing code references.

LakshyAAAgrawal commented 1 week ago

One of the primary intentions I had with this repository was for it to be reference for the community to implement and use clients for various language servers. While most language server clients target an IDE like vscode, multilspy is intended to be a repository for programmatic uses of language servers, as opposed to user-facing usecases like in vscode.

I would be really glad if you could contribute your implementation of a js client!

themichaelusa commented 1 week ago

@LakshyAAAgrawal I ended up implementing a JS/TS client with the typescript-language-server package about a month ago. It's actually being used in production now, but I definitely didn't follow your procedures haha. Will follow your guide when we implement Golang and PHP next!

@mrT23 @imanewman This is our implementation, please lmk if you find this useful for your implementation(s) in any way. Should work out of the box, but you'll need NPM + Typescript + the language server installed locally.

LakshyAAAgrawal commented 1 week ago

@themichaelusa I checked out the repository, and it looks well implemented!

Would it be okay with you and your team to create a PR and add it to multilspy so that the wider community could make use of it?

themichaelusa commented 1 week ago

Yes absolutely @LakshyAAAgrawal! Our intention was always to merge our work into multilspy. It has served us very well so far and we'd like to give back to the community. I will open it later today once I fill out the Microsoft CLA.

themichaelusa commented 1 week ago

6 Here's my branch, will work on tidying it up today. Should be ready very soon.

mrT23 commented 1 week ago

@themichaelusa Thanks for sharing the code.

i cloned the draft PR, and did some QA. on some repos it worked, on others, it got stuck. I saw this happens for an inner code of us, but also for public repos. If I am doing something wrong, do let me know

Here is a reproducible example: 1) clone https://github.com/gvergnaud/ts-pattern

2) run:

import asyncio

from multilspy import SyncLanguageServer, LanguageServer
from multilspy.multilspy_config import MultilspyConfig
from multilspy.multilspy_logger import MultilspyLogger

async def run():
    config = MultilspyConfig.from_dict({"code_language": "typescript"}) # Also supports "python", "rust", "csharp"
    logger = MultilspyLogger()

    repo = "local_path_to_replace/ts-pattern"
    rel_file =  "src/match.ts"

    print("Starting server...")
    lsp = LanguageServer.create(config, logger, repo)
    print("Server started!")

    async with lsp.start_server():
        print("request_document_symbols...")
        result = await lsp.request_document_symbols(
            rel_file, # Filename of location where request is being made
        )
        print("request_document_symbols done!")

        print(result)

if __name__ == '__main__':
    asyncio.run(run())

it gets stuck here

image

interestingly, when I run only on the src directory:

    repo = "local_path/ts-pattern/src"
    rel_file =  "match.ts"

it works, so I think the LSP might have problems with repos too large, or with some specific files or extensions present somewhere in the repo.

themichaelusa commented 1 week ago

@mrT23 yeah fairly likely that if node_modules isn't being included when you index ts-pattern - it may be related to that event loop fix that I reverted. Or you aren't filtering out non-JS/TS files.

But I'm not sure of the exact solution yet, it's pretty late for me right now. Will investigate tomorrow morning. I think I have some code that can help you with that filtering step too.

themichaelusa commented 6 days ago

@mrT23 Haven't had time to test your code out yet - mainly because I've actually been using the synchronous LSP client in multilspy, so kinda unfamiliar with the async behavior.

But here's some helper functions I wrote that essentially prunes out all non-relevant file types for a given repo. Also works on multi-lingual monorepos, and splits them into unique trees per language group e.g python that still preserve their structure so multilspy can still work over it.

EXT_TO_LANGUAGE_DATA is essentially just a dictionary with this form.

{
    ".py": {
        "is_code": true,
        "language_mode": "python"
    },
    ...
}
LANGUAGE_TO_LSP_LANGUAGE_MAP = {
    "python": "python",
    "javascript": "typescript",
    "typescript": "typescript",
    "java": "java",
    "rust": "rust",
    "csharp": "csharp"
}

def get_all_paths_from_root_relative(root_path):
    abs_paths, rel_paths = [], []
    for root, dirs, files in os.walk(root_path):
        for file in files:
            abs_path = os.path.join(root, file)
            relpath = os.path.relpath(abs_path, root_path)
            abs_paths.append(abs_path)
            rel_paths.append(relpath)
    return abs_paths, rel_paths

def get_language_from_ext(path):
    root, ext = os.path.splitext(path)
    language_info = EXT_TO_LANGUAGE_DATA.get(ext, {})
    is_code = language_info.get("is_code", False)
    language = language_info.get("language_mode", None)
    lsp_language = LANGUAGE_TO_LSP_LANGUAGE_MAP.get(language, None)
    return lsp_language, language, is_code

def copy_and_split_root_by_language_group(abs_root_path):
    abs_paths, _ = get_all_paths_from_root_relative(abs_root_path)
    languages = set()

    for p in abs_paths:
        lsp_language, language, is_code = get_language_from_ext(p)
        if is_code:
            languages.add(lsp_language)
    languages = [l for l in languages if l]
    num_root_copies = len(languages)

    copy_paths = []
    # copy the root directory num_root_copies times into /tmp/callgraph_root_copies/{random_hash}
    for _ in range(num_root_copies):
        random_hash = str(uuid.uuid4()).split('-')[0]
        root_copy_path = os.path.join(TMP_DIR_PARENT, random_hash)
        shutil.copytree(abs_root_path, root_copy_path)
        copy_paths.append(root_copy_path)

    for copy_path, language in zip(copy_paths, languages):
        for root, dirs, files in os.walk(copy_path):
            for file in files:
                file_language, _, is_code = get_language_from_ext(file)
                if file_language == language and is_code:
                    continue
                else:
                    os.remove(os.path.join(root, file))

    # remove copy_paths that only have directories and no files
    nonempty_copy_paths = []
    for copy_path, language in zip(copy_paths, languages):
        files_set = set()
        for root, dirs, files in os.walk(copy_path):
            for file in files:
                files_set.add(file)
        if not files_set:
            print(f"copy_path: {copy_path} is empty")
            shutil.rmtree(copy_path)
            continue
        nonempty_copy_paths.append((copy_path, language))

    return nonempty_copy_paths

Hope this helps you for now. I'll look at your example at night. Will also discuss w/ @LakshyAAAgrawal if this automated approach for filtering out language specific subtrees is appropriate for inclusion in multilspy with a follow up PR.

mrT23 commented 5 days ago

The problem occurs both with sync and async

For each repo, copying all the relevant files to a cloned repo seems to me like a non optimal operation. I think we should be able to tell the LSP which file types to take, instead of doing by default cloning

themichaelusa commented 5 days ago

I generally agree that cloning isn't perfect. Luckily this is code that runs outside of the package.

I think we should be able to tell the LSP which file types to take, instead of doing by default cloning

I do wonder if this is possible in initialize_params.json e.g programmatically excluding certain subpaths or file extensions.

Will look into it after this PR is merged.