sum244 / APICAD-artifact

MIT License
0 stars 0 forks source link

Document Parsing - Add semantic parsing to the language processing. #3

Open jconn0 opened 6 months ago

jconn0 commented 6 months ago

Add semantic parsing to provide more context to the document parsing.

Could be done using HanLP, the existing NLP library, or by converting the project to a different library.

jconn0 commented 6 months ago

APICAD currently uses the pretrained model "UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE" which supports both semantic role labeling and semantic dependency parsing.

Semantic role labeling is better at identifying actions such as conditional statements and causal relations. This is more fitting for our use case of trying to identify an API's actions and outputs to detect errors.

jconn0 commented 6 months ago

The model supports two different types of SRL: beginning, inside, outside and ranking.

BIO SRL is better at extracting role-containing phrases from structured texts with clearly distinguished roles.

Rank SRL is better at handling ambiguity since it considers the broader context within a sentence.

Assuming we are using well written API documentation that is structured clearly BIO SRL should be more effective.

jconn0 commented 5 months ago

Attempting to build the model on the provided file for Curl produces errors indicating issues with the bitcode preventing the evaluation from functioning correctly. The evaluation runs fine for the project case files which are provided as built.

jconn0 commented 5 months ago

We'll use the basic example containing the pre-built bitcode files to get around this. In order to reanalyze the documents with the changes to the HanLP model you can run (specifying the target prevents it from running on all three projects): apicad doc-analyze --target glibc After this, to perform the analysis run: apicad analyze --target SSL_get_ex_data EVP_MD_CTX_new To run the detection with document specifications run: apicad detect --enable-doc

jconn0 commented 5 months ago

My computer was unable to handle running the document analysis. The process would just output "killed" to the terminal and having another terminal open with top confirmed it was running out of resources.

I've reinstalled everything on a GitHub Codespace that is able to run everything.