manisnesan / fastchai

Repository capturing deep learning & nlp experiments using fastai & pytorch
Apache License 2.0
2 stars 0 forks source link

built LLM/ AI summarizer using a selected body of content (documentation and solutions for example) #47

Open manisnesan opened 1 year ago

manisnesan commented 1 year ago

Expected Outcomes

Outline Steps

manisnesan commented 1 year ago

From https://til.simonwillison.net/gpt3/chatgpt-api

class ChatBot:
    def __init__(self, system=""):
        self.system = system
        self.messages = []
        if self.system:
            self.messages.append({"role": "system", "content": self.system})

    def __call__(self, message):
        self.messages.append({"role": "user", "content": message})
        result = self.execute()
        self.messages.append({"role": "assistant", "content": result})
        return result

    def execute(self):
        completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=self.messages)
        # Uncomment this to print out token usage each time, e.g.
        # {"completion_tokens": 86, "prompt_tokens": 26, "total_tokens": 112}
        # print(completion.usage)
        return completion.choices[0].message.content

simon = ChatBot("You are a chatbot imitating Simon Willison. Pretend to be Simon.") simon("Tell me about yourself")

manisnesan commented 1 year ago

https://medium.com/@agrofail/summarization-with-hugging-face-and-blurr-1b613265d155

BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings

manisnesan commented 1 year ago

https://towardsdatascience.com/make-a-text-summarizer-with-gpt-3-f0917a07189e

manisnesan commented 1 year ago

screenshot-miro com-2023 04 27-10_29_31

manisnesan commented 1 year ago

https://twitter.com/syedmuzamilm/status/1654051264295055361?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

Article and Youtube Summarizer using Youtube using langchain.

import os

from langchain import OpenAI
from langchain.document_loaders import YoutubeLoader, WebBaseLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

def document_splitter(docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,
        chunk_overlap = 100,
    )
    splitted_docs = text_splitter.split_documents(docs)

    return splitted_docs

def get_summary(splitted_docs):
    chain = load_summarize_chain(llm, chain_type="map_reduce")

    summary =chain.run(splitted_docs)

    return summary

def youtube_video_summariser(url):
    loader = YoutubeLoader.from_youtube_url(url)
    docs = loader.load()

    splitted_docs = document_splitter(docs)

    summary = get_summary(splitted_docs)
    return summary

def article_summariser(url):
    loader = WebBaseLoader(url)
    docs = loader.load()

    splitted_docs = document_splitter(docs)

    summary = get_summary(splitted_docs)
    return summary
manisnesan commented 1 year ago

Example Prompts

manisnesan commented 1 year ago

Prompt engineering may also help you unlock summarization capabilities of GPT-3. The idea here is pretty simple: in our prompt we put in a passage of text that we want to summarize and then add something like "To summarize: " or "TL;DR".

You could try out other prompts besides "TL;DR" like "in a couple words," "to summarize", "to simplify" and similar.

Source: https://wandb.ai/ivangoncharov/GPT-3/reports/Summary-Sentiment-Question-Answering-More-5-Creative-Tips-for-GPT-3-Prompt-Engineering--VmlldzoxODY0Nzky

manisnesan commented 1 year ago

Text Summarization by fine tuning T5 or BART method https://wandb.ai/biased-ai/huggingface/reports/Text-Summarization-on-HuggingFace--Vmlldzo3ODA5MjI

manisnesan commented 1 year ago

Talks about the Challenges

https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper

One of the biggest challenges with summarization, however, is factuality: does the summary reflect accurately what the original document said? There are other characteristics, such as fluency and relevance that are also important, but LLMs are actually pretty good at both of those. Factuality (or its evil twin: hallucination) on the other hand is a known issue with LLMs. And it’s no use being fluent if you’re wrong.

manisnesan commented 1 year ago

Example Solution

✗ head -n 1 ./data/raw/public_solutions_req_fields.jsonl

{"title":"Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool","teaser":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"issue":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"environment":[" Red Hat Enterprise Linux 6.2 vmware vmware tools"],"rootcause":[" There is some problem with the interaction between vmware tools and RHEL6. "],"resolution":["* Uninstall the vmware tools and try again."],"diagnostic":["check ~/.xsession-errors logs ~ "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"." ~"]}

Different response variations from the prompt

✗ llm "Summarize the provided json in 4 sentences or less concatenating the title, rootcause and issue. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)" 

Unable to login to GUI in Red Hat Enterprise Linux 6.2 after updating VMware tool. Users can login via ssh or console, but not able to login to GUI. GUI hangs after accepting username and password. The interaction between VMWare tools and RHEL6 causes the problem. To resolve, uninstall the VMWare tools and try again.

✗ llm "Summarize the provided json in 4 sentences or less. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)" 

Some users are unable to login to the GUI in Red Hat Enterprise Linux 6.2 after updating VMware tools. They can login via SSH or console, but the GUI hangs after accepting their username and password. This issue seems to be caused by a problem with the interaction between VMware tools and RHEL6. To resolve it, users should uninstall the VMware tools and try again. One possible diagnostic step is to check the ~/.xsession-errors logs for the error message "Xlib: extension ' VMWARE_CTRL' missing on display ':0.0'."

google

screenshot-www bing com-2023 09 13-11_44_31

manisnesan commented 1 year ago

Added instruction to ensure summary consistency with the original content. Based on Paper: ChatGPT as a Factual Inconsistency Evaluator for Text Summarization

✗ llm "You are a summary generator. Summarize the json in 200 characters or less. The summary should be consistent with the json. Note that consistency means all information in summary is supported by json. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)" 

Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool. Users can login via ssh or console, but GUI hangs after entering username and password. The issue may be due to a problem with the interaction between vmware tools and RHEL6. To resolve, uninstall vmware tools and try again. Check ~/.xsession-errors logs for "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"."

manisnesan commented 1 year ago

SummEval - Summarize Evaluation

SummEval is a project that aims to provide resources for the evaluation of summarization systems. It includes summaries generated by different models, human annotations of focus and coverage, and a toolkit for computing various metrics and correlations. The project is a collaboration between Yale LILY Lab and Salesforce Research, and the paper was published in the Findings of ACL 2021 ¹. SummEval covers eight languages: English, Indonesian, French, Turkish, Chinese, Russian, German, and Spanish. The GitHub repository contains the data, code, and instructions for using SummEval.

Source: Conversation with Bing, 2023-09-18 (1) GitHub - Yale-LILY/SummEval: Resources for the "SummEval: Re-evaluating .... https://github.com/Yale-LILY/SummEval. (2) GitHub - fajri91/Multi_SummEval: Evaluating the Efficacy of .... https://github.com/fajri91/Multi_SummEval. (3) GitHub - YizhuLiu/summeval. https://github.com/YizhuLiu/summeval.

manisnesan commented 1 year ago

Similar Problem on News Summarization

https://www.arxiv-vanity.com/papers/2301.13848/

This page is a research paper that investigates the summarization capabilities of large language models (LLMs). The main points are:

Source: Conversation with Bing, 11/29/2023 (1) First, we .... undefined. (2) Not only d.... undefined. (3) Comparing .... undefined. (4) However, t.... undefined.

manisnesan commented 1 year ago

Another way to evaluate is treat it as Entailment Inference

From https://arxiv.org/pdf/2303.15621.pdf

We provide ChatGPT with the question including the source document and the corresponding generated summary and ask it to answer yes or no to infer the consistency between the source document and the corresponding generated summary, and then we collect the decisions from the outputs and aggregate the results

manisnesan commented 11 months ago

LLM-based summarization: A case study of human, Llama 2 70b and GPT-4 summarization quality

Summary: The search results provide a detailed case study of the quality of summarization by Llama 2 70b and GPT-4 compared to human legislative interns. The study involved a blind test of a legal expert who scored the summaries generated by the different authors. The results showed that GPT-4 outperformed both Llama 2 70b and human interns in summarizing legislative bills. The study also discussed the insights gained from GPT-4's superior performance, which were used to improve the Llama 2 70b prompt to enhance its quality. The study highlighted the challenges and potential biases in determining the authorship of the summaries, as well as the specific issues with the Llama 2 70b prompts. It also presented the modified prompt and the positive impact it had on the quality of the Llama 2 70b summaries. The study concluded that while GPT-4 outperformed Llama 2 70b and human interns, using GPT-4 to direct the improvement of Llama 2 70b's prompts led to an enhancement in the latter's quality.

The study provides valuable insights into the comparative performance of AI models and human interns in summarizing legislative bills, as well as the potential for using superior models to improve the performance of others.

The last question is: "Given that Llama 2 70b costs approximately 3% of what GPT-4 costs, is there a way we can improve the performance of Llama 2 70b?"

Citations: [1] https://www.anyscale.com/blog/llm-based-summarization-a-case-study-of-human-llama-2-70b-and-gpt-4-summarization-quality#:~:text=TL%3BDR%3A%20In%20a%20blind%20test%20of%20a%20le

manisnesan commented 10 months ago

image

Pragmatic approach for iteratively adding sophistication

manisnesan commented 9 months ago

From rasbt post - Flan T5 is a great go to model for text classification.

Tiny titans - Can smaller LLM models punch above their weight for meeting summarization

manisnesan commented 9 months ago

Example prompt Generate a concise, accurate, and relevant summary of the following KCS solution, maintaining fluency and consistency with the original content."

manisnesan commented 9 months ago

https://octo.ai/reduce-llm-costs-for-text-summarization-by-over-50-percent-with-mixtral-on-octoai

manisnesan commented 9 months ago

How to evaluate a summarization task - Openai cookbook https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization

REVISED SUMMARY

This article¹ explains how to evaluate a summarization task using different methods, such as:

Source: Conversation with Bing, 2/14/2024 (1) How to evaluate a summarization task | OpenAI Cookbook. https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization. (2) A Step-By-Step Guide to Evaluating an LLM Text Summarization Task .... https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task. (3) Summarizing Worksheets & Activities | Reading Comprehension. https://www.ereadingworksheets.com/free-reading-worksheets/reading-comprehension-worksheets/summarizing-worksheets-and-activities/.

User Summary

Automated Evaluation using Rouge and Bertscore. Requires reference summary for doing the evaluation. Limitation: Misses nuanced aspects such as fluency, coherence etc. Human labeling is a bottleneck.

LLM based evaluation inspired from G-Eval incorporates reference free approach. Bottleneck is the Context window and could favor LLM generated snippets. Can be done at scale.

manisnesan commented 9 months ago

https://github.com/ibm-ecosystem-engineering/SuperKnowa - Build Enterprise RAG (Retriver Augmented Generation) Pipelines to tackle various Generative AI use cases with LLM's by simply plugging componants like Lego pieces. This repo is intended for IBM Ecosystem partners.

Guide to fine tune Flan T5 - https://www.datacamp.com/tutorial/flan-t5-tutorial

https://github.com/ibm-ecosystem-engineering/SuperKnowa/blob/main/Enterprise%20LLM%20Use%20Cases/3.%20Summarization/payload/summary-payload.json

User Prompt

```json
{
  "model_id": "google/flan-t5-xxl",
  "inputs": [],
  "parameters": {
    "decoding_method": "greedy",
    "temperature": 0.7,
    "top_p": 1,
    "top_k": 50,
    "min_new_tokens": 10,
    "max_new_tokens": 50
  }
}```

System Generated

I hope this helps you understand the json format better. If you have any other questions, feel free to ask me. 😊

: FLAN: A Flexible and Lightweight Attention Network for Few-Shot Natural Language Generation : Text Generation with Transformers : The Curious Case of Neural Text Degeneration : The Power of Scale for Parameter-Efficient Prompt Tuning : T5: Exploring the Limits of Transfer Learning with Text-to-Text Transformer

manisnesan commented 9 months ago

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Compressing-the-Story-The-Magic-of-Text-Summarization--VmlldzozNTYxMjc2

manisnesan commented 9 months ago

Critical Dimensions for Snippet Generation's Data Annotation & Evaluation

manisnesan commented 9 months ago

Factscore

image

manisnesan commented 9 months ago

Uptrain - LLM evaluation tool from this post

image[]()

manisnesan commented 9 months ago

My benchmark for LLM post started by Andrej karpathy - Post

image

manisnesan commented 9 months ago

Chain of Density produces more dense and human preferable summaries than the vanilla GPT4

See latent space - llm paper club

manisnesan commented 9 months ago

https://huggingface.co/ibm/labradorite-13b

Citations: [1] https://huggingface.co/ibm/labradorite-13b [2] https://huggingface.co/PygmalionAI/metharme-13b [3] https://huggingface.co/dfurman/LLaMA-13B

image image

manisnesan commented 8 months ago

https://x.com/eugeneyan/status/1764066697454182592?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

What are some good resources on LM evals for downstream tasks (classification, summarization, translation)? Some I found:

• HELM: arxiv.org/abs/2211.09110 • NLG Systems: arxiv.org/abs/2008.12009 • LLMs Evals: arxiv.org/abs/2307.03109 • SummEval: arxiv.org/abs/2007.12626 • Benching LLMs for summarization: arxiv.org/abs/2301.13848 • MachineTranslate: machinetranslate.org/metrics • Evaluating ChatGPT extraction: arxiv.org/abs/2304.11633 • LLMs for Evals: arxiv.org/abs/2401.07103

Especially interested in classification, extraction, summarization, translation, copyright regurgitation, toxicity, etc

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

Components of LLM Evaluation: Model to Evaluate: It can be an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model (e.g., a HuggingFace text summarization pipeline). Metrics: LLM evaluation uses specific LLM metrics. Evaluation Data: The data your model is evaluated on can be a pandas DataFrame, a Python list, a NumPy array, or an mlflow.data.dataset.Dataset() instance.

manisnesan commented 7 months ago

Chain of Density Prompting technique

image

GPT 4 Summarization with Chain of Density Prompting

The following CoD prompt is taken directly from the paper.

Article: {{ ARTICLE }}

You will generate increasingly concise, entity-dense summaries of the above Article.

Repeat the following 2 steps 5 times.

Step 1. Identify 1-3 informative Entities ("; " delimited) from the Article which are missing from the previously generated summary. Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

A Missing Entity is:

Guidelines:

Remember, use the exact same number of words for each summary.

Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".

manisnesan commented 7 months ago

Prompt Engineering Strategy - Split Complex tasks into Simpler tasks

Summarize long documents piecewise and construct full summary recursively

manisnesan commented 4 months ago

No statistical significance on the performance impact of LLM on problem solving tasks when changing the sampling of temperature from 0.0 to 1.0.

Recommended to set 0.0 for maximal reproducibility without accuracy impact.

https://x.com/sergeykarayev/status/1810535376295243888?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ