oisee / zvdb

ABAP Vector Search Library
MIT License
26 stars 4 forks source link

class `zcl_oai_01_dotenv` ? #2

Closed larshp closed 10 months ago

larshp commented 11 months ago

Hi, I'm trying to find class zcl_oai_01_dotenv which is referenced in the source code. Also cannot find it on github, https://github.com/search?q=zcl_oai_01_dotenv&type=code

Are there any additional dependencies more than https://github.com/microsoft/aisdkforsapabap ?

image

looks like similar problems with the following, but didnt investigate,

larshp commented 11 months ago

https://github.com/bizhuka/eui ?

oisee commented 11 months ago

these classes were referenced in the zcl_vdb_001_embedding class (dependent on the "lite azure-open-ai-sdk"), zcl_vdb_002_embedding_full is the one used in demo programs and in the rest of the code.

zcl_eui_alv by @bizhuka has been replaced with the standard salv.

unnecessary dependencies now dropped from the package.

thank you, Lars.

larshp commented 11 months ago

cool

looks like there is a syntax error in zvdb_002_demo_03_reindex

zcl_vdb_002_embedding renamed to zcl_vdb_002_embedding_full ?

image

larshp commented 11 months ago

you can consider installing https://github.com/marketplace/abaplint, its free

plus adding configuration file abaplint.jsonc,

{
  "global": {
    "files": "/src/**/*.*"
  },
  "dependencies": [
    {
      "url": "https://github.com/microsoft/aisdkforsapabap",
      "folder": "/deps",
      "files": "/src/**/*.*"
    }
  ],
  "syntax": {
    "version": "v755",
    "errorNamespace": "^(Z|Y|LCL_|TY_|LIF_)"
  },
  "rules": {
    "begin_end_names": true,
    "check_ddic": true,
    "check_include": true,
    "check_syntax": true,
    "global_class": true,
    "implement_methods": true,
    "method_implemented_twice": true,
    "parser_error": true,
    "superclass_final": true,
    "unknown_types": true,
    "xml_consistency": true
  }
}

then there will be feedback on each commit or pull request in the repository

oisee commented 11 months ago

Thank you, Lars. (I came here to comment that utilizing your CI/CD for the automated build of the package would be great, and all of those issues could have been caught before even reaching the GitHub repo =) And you have already suggested that, tnx!

Will dig deeper.

oisee commented 11 months ago

@larshp yes, zcl_vdb_002_embedding_full has the same interface as zcl_vdb_002_embedding (but utilizes the "full" https://github.com/microsoft/aisdkforsapabap SDK, and not the "lite" one).

In this code fragment, zcl_vdb_002_embedding can be renamed to zcl_vdb_002_embedding_full.

oisee commented 11 months ago

In this code fragment, zcl_vdb_002_embedding can be renamed to zcl_vdb_002_embedding_full.

Updated in the latest commit.

larshp commented 11 months ago

zero issues, thanks

closing

oisee commented 11 months ago

you can consider installing https://github.com/marketplace/abaplint, its free plus adding configuration file abaplint.jsonc,

Done.

then there will be feedback on each commit or pull request in the repository

I'm unsure if I got it right; where to find this feedback?

oisee commented 11 months ago

you can consider installing https://github.com/marketplace/abaplint, its free

oh, it works locally =) perfect =)

all passed =)

oisee commented 11 months ago

@larshp I wonder if it is simple to build a ctag on top of abaplint to generate .ctag files, for example for this: https://github.com/universal-ctags/ctags to utilize it here: https://github.com/oisee/autodoc (to inject AST-related data for doc generation)

larshp commented 11 months ago

yea, guess its possible, suggest opening an issue in the abaplint repository

Technically I think it has more CSTs than ASTs, anyhow. Also some LSP implementation is done in abaplint to support the vscode extension

oisee commented 10 months ago

yea, guess its possible, suggest opening an issue in the abaplint repository

Technically I think it has more CSTs than ASTs, anyhow. Also some LSP implementation is done in abaplint to support the vscode extension

Hi Lars, sorry for the late response (was/still is on a medical leave ^_^).

TL;DR: To have LLM finetuned for ABAP, we have to get Hi-Quality dataset of pairs: Prompt -> ABAP Code unit, manually -> almost impossible. So the simple idea is to just reverse the process: generate Hi-Quality prompt for existing "good-enough" open-source (or owned proprietary) ABAP Code. With this ABAP Code Corpus we can fine tune any model.

Following above discussion on integrating ctags with abaplint for improved doc generation, I wanted to elaborate on how such integration could serve as crucial component in the "ABAP Code Corpus" project (I started earlier this August).

This project is dedicated to enhancing LLM's comprehension and generation of ABAP code by creating a rich dataset from segmented open-source ABAP (or! from your own proprietary code base).

A significant part of this involves generating descriptive prompts based on AST analysis of the code, which are then used to fine-tune the LLM.

The missing piece might be building a ctag block on top of abaplint. It would enable us to inject detailed AST/CST-related data into our prompts, leading to a much more nuanced and accurate dataset. This, in turn, would substantially elevate the LLM's capability to produce context-aware ABAP code, aiding developers significantly.

Your insight on abaplint’s usage of CSTs over ASTs and its LSP implementation has been valuable. It suggests that while this all on the right path, the addition of ctags could be the key to unlocking even more precise documentation and code generation features.

Incorporating ctags could potentially streamline our workflow as illustrated in the attached Mermaid diagram, ensuring each stage from code segmentation to LLM fine-tuning is informed by the most detailed code structure information possible.

Would you be open to discussing this further, possibly opening an issue in the abaplint repository to explore this integration?

Best regards, Alice

P.S. diagram:

graph TD

A[ABAP Code Source] -->|Retrieve ABAP Code| B[Code Segmentation]

B -->|Segment Code into Units| C[Prompt Generation]

B -->|Extract AST Key Points| D[AST Analysis]

C -->|Generate Descriptive Prompts| E[Data Pairing]

D -->|Inject AST Key Points into Prompts| E

E -->|Pair Prompts with Code Units| F[LLM Fine-Tuning]

F -->|Fine-Tune LLM with Dataset| G[Improved Prompt Responses]

https://www.youtube.com/watch?v=slgKCBjy2Hg - short video with brief overview.

P.P.S. PoC was completed in August — vanilla llama2 model was able to produce decent ABAP code after fine-tuning on relatively small and quite low-quality dataset. I wonder what can be achieved with mistral/zephyr. (and ofcourse GPT3.5/GPT4 on top of OpenAI API)

larshp commented 10 months ago

Would you be open to discussing this further

sure, lets talk when you are back from leave, do you have my email address?

oisee commented 10 months ago

Would you be open to discussing this further sure, lets talk when you are back from leave, do you have my email address?

great, tnx! =) I have found one on linkedin ^_^