built LLM/ AI summarizer using a selected body of content (documentation and solutions for example)

manisnesan commented 1 year ago

Expected Outcomes

Prompt: Summarize the content from the url (do not emit the url back) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_file_systems/index into a short summary.
Output: The documentation guide from Red Hat explains how to manage file systems in Red Hat Enterprise Linux 9. It covers various file systems, their benefits, and how to create, mount, resize, and manage them. The guide also provides practical examples and covers other topics such as partitioning, formatting disks, quotas, compression, encryption, and consistency checking and repair.

Outline Steps

Filter the KCS Solutions to only non-entitled / public solutions using jq. We are going to use managed LLM providers.

jq -c 'select(.bs_field_kcs_entitled_boolean == false)' ./data/raw/solutions.jsonl > ./data/raw/public_solutions.jsonl 
(llm) ➜  hayward git:(gh-issue47-summarizer) ✗ wc -l ./data/raw/public_solutions.jsonl 
8611 ./data/raw/public_solutions.jsonl

Filter to only these fields from each record

entity_id
bundle_name
url
label
tm_published_teaser
tm_field_kcs_issue_txt
sm_vid_Products
tm_field_kcs_environment_txt
tm_field_kcs_rootcause_txt
tm_field_kcs_resolution_txt

tm_field_kcs_diagnostic_txt


` jq -c '. | {id: .entity_id, documentKind: .bundle_name, view_uri: .url, title: .label, teaser: .tm_published_teaser, issue: .tm_field_kcs_issue_txt, environment: .tm_field_kcs_environment_txt, product: .sm_vid_Products, rootcause: .tm_field_kcs_rootcause_txt, resolution: .tm_field_kcs_resolution_txt, diagnostic: .tm_field_kcs_diagnostic_txt}' ./data/raw/public_solutions.jsonl  > ./data/raw/public_solutions_req_fields.jsonl`

Craft a prompt, Sample Code to OpenAI GPT 3.5

manisnesan commented 1 year ago

From https://til.simonwillison.net/gpt3/chatgpt-api

class ChatBot:
    def __init__(self, system=""):
        self.system = system
        self.messages = []
        if self.system:
            self.messages.append({"role": "system", "content": self.system})

    def __call__(self, message):
        self.messages.append({"role": "user", "content": message})
        result = self.execute()
        self.messages.append({"role": "assistant", "content": result})
        return result

    def execute(self):
        completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=self.messages)
        # Uncomment this to print out token usage each time, e.g.
        # {"completion_tokens": 86, "prompt_tokens": 26, "total_tokens": 112}
        # print(completion.usage)
        return completion.choices[0].message.content

simon = ChatBot("You are a chatbot imitating Simon Willison. Pretend to be Simon.") simon("Tell me about yourself")

manisnesan commented 1 year ago

https://medium.com/@agrofail/summarization-with-hugging-face-and-blurr-1b613265d155

BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings

manisnesan commented 1 year ago

https://towardsdatascience.com/make-a-text-summarizer-with-gpt-3-f0917a07189e

manisnesan commented 1 year ago

screenshot-miro com-2023 04 27-10_29_31

manisnesan commented 1 year ago

https://twitter.com/syedmuzamilm/status/1654051264295055361?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

Article and Youtube Summarizer using Youtube using langchain.

import os

from langchain import OpenAI
from langchain.document_loaders import YoutubeLoader, WebBaseLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

def document_splitter(docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,
        chunk_overlap = 100,
    )
    splitted_docs = text_splitter.split_documents(docs)

    return splitted_docs

def get_summary(splitted_docs):
    chain = load_summarize_chain(llm, chain_type="map_reduce")

    summary =chain.run(splitted_docs)

    return summary

def youtube_video_summariser(url):
    loader = YoutubeLoader.from_youtube_url(url)
    docs = loader.load()

    splitted_docs = document_splitter(docs)

    summary = get_summary(splitted_docs)
    return summary

def article_summariser(url):
    loader = WebBaseLoader(url)
    docs = loader.load()

    splitted_docs = document_splitter(docs)

    summary = get_summary(splitted_docs)
    return summary

manisnesan commented 1 year ago

Example Prompts

Summarize this article in 5 sentences or less
Summarize this article in 3 bullet points
Provide a 2-sentence engaging summary of this article that I can use in an email newsletter

manisnesan commented 1 year ago

Prompt engineering may also help you unlock summarization capabilities of GPT-3. The idea here is pretty simple: in our prompt we put in a passage of text that we want to summarize and then add something like "To summarize: " or "TL;DR".

You could try out other prompts besides "TL;DR" like "in a couple words," "to summarize", "to simplify" and similar.

Source: https://wandb.ai/ivangoncharov/GPT-3/reports/Summary-Sentiment-Question-Answering-More-5-Creative-Tips-for-GPT-3-Prompt-Engineering--VmlldzoxODY0Nzky

manisnesan commented 1 year ago

Text Summarization by fine tuning T5 or BART method https://wandb.ai/biased-ai/huggingface/reports/Text-Summarization-on-HuggingFace--Vmlldzo3ODA5MjI

manisnesan commented 1 year ago

Talks about the Challenges

https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper

One of the biggest challenges with summarization, however, is factuality: does the summary reflect accurately what the original document said? There are other characteristics, such as fluency and relevance that are also important, but LLMs are actually pretty good at both of those. Factuality (or its evil twin: hallucination) on the other hand is a known issue with LLMs. And it’s no use being fluent if you’re wrong.

manisnesan commented 1 year ago

Example Solution

✗ head -n 1 ./data/raw/public_solutions_req_fields.jsonl

{"title":"Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool","teaser":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"issue":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"environment":[" Red Hat Enterprise Linux 6.2 vmware vmware tools"],"rootcause":[" There is some problem with the interaction between vmware tools and RHEL6. "],"resolution":["* Uninstall the vmware tools and try again."],"diagnostic":["check ~/.xsession-errors logs ~~~ "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"." ~~~"]}

Different response variations from the prompt

✗ llm "Summarize the provided json in 4 sentences or less concatenating the title, rootcause and issue. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"

Unable to login to GUI in Red Hat Enterprise Linux 6.2 after updating VMware tool. Users can login via ssh or console, but not able to login to GUI. GUI hangs after accepting username and password. The interaction between VMWare tools and RHEL6 causes the problem. To resolve, uninstall the VMWare tools and try again.

✗ llm "Summarize the provided json in 4 sentences or less. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"

Some users are unable to login to the GUI in Red Hat Enterprise Linux 6.2 after updating VMware tools. They can login via SSH or console, but the GUI hangs after accepting their username and password. This issue seems to be caused by a problem with the interaction between VMware tools and RHEL6. To resolve it, users should uninstall the VMware tools and try again. One possible diagnostic step is to check the ~/.xsession-errors logs for the error message "Xlib: extension ' VMWARE_CTRL' missing on display ':0.0'."

google

screenshot-www bing com-2023 09 13-11_44_31

manisnesan commented 1 year ago

Added instruction to ensure summary consistency with the original content. Based on Paper: ChatGPT as a Factual Inconsistency Evaluator for Text Summarization

✗ llm "You are a summary generator. Summarize the json in 200 characters or less. The summary should be consistent with the json. Note that consistency means all information in summary is supported by json. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"

Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool. Users can login via ssh or console, but GUI hangs after entering username and password. The issue may be due to a problem with the interaction between vmware tools and RHEL6. To resolve, uninstall vmware tools and try again. Check ~/.xsession-errors logs for "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"."

manisnesan commented 1 year ago

SummEval - Summarize Evaluation

SummEval is a project that aims to provide resources for the evaluation of summarization systems. It includes summaries generated by different models, human annotations of focus and coverage, and a toolkit for computing various metrics and correlations. The project is a collaboration between Yale LILY Lab and Salesforce Research, and the paper was published in the Findings of ACL 2021 ¹. SummEval covers eight languages: English, Indonesian, French, Turkish, Chinese, Russian, German, and Spanish. The GitHub repository contains the data, code, and instructions for using SummEval.

Source: Conversation with Bing, 2023-09-18 (1) GitHub - Yale-LILY/SummEval: Resources for the "SummEval: Re-evaluating .... https://github.com/Yale-LILY/SummEval. (2) GitHub - fajri91/Multi_SummEval: Evaluating the Efficacy of .... https://github.com/fajri91/Multi_SummEval. (3) GitHub - YizhuLiu/summeval. https://github.com/YizhuLiu/summeval.

manisnesan commented 1 year ago

Pragmatic approach for iteratively adding sophistication

If we have references available, we can start with straightforward reference-based metrics such as ROUGE, METEOR, BERTScore, and MoverScore.
If we don’t have references, we can try adapting reference-based metrics to compare summaries with source documents instead.
We can also finetune and apply an NLI model to evaluate on consistency.
If inference is not too costly, we can also use a sampling-based approach to assess for consistency among multiple summaries generated for the same document.
We can also consider using a strong LLM to evaluate summaries (though we might also have to evaluate the evaluator).
If preference data is available, we can try training a reward model as an evaluator.
Finally, if more performance is needed, we can consider QA-based metrics. Nonetheless, this is complex with many components, such as selecting answers, finetuning QG and QA models, and evaluating answer overlap

manisnesan commented 9 months ago

From rasbt post - Flan T5 is a great go to model for text classification.

Tiny titans - Can smaller LLM models punch above their weight for meeting summarization

manisnesan commented 9 months ago

Example prompt Generate a concise, accurate, and relevant summary of the following KCS solution, maintaining fluency and consistency with the original content."

manisnesan commented 9 months ago

https://octo.ai/reduce-llm-costs-for-text-summarization-by-over-50-percent-with-mixtral-on-octoai

manisnesan commented 9 months ago

How to evaluate a summarization task - Openai cookbook https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization

REVISED SUMMARY

This article¹ explains how to evaluate a summarization task using different methods, such as:

ROUGE and BERTScore: These are traditional metrics that compare the generated summary with a reference summary, based on word overlap and semantic similarity. However, they may not capture the quality of summaries well, especially for open-ended generation tasks or texts composed of concatenated chunks.
G-Eval: This is a novel method that uses gpt-4, a large language model (LLM), as a reference-free evaluator. It scores the summary based on how well it can generate the original text from it, using a sliding window approach. This method can be done at scale, but it may favor summaries that resemble gpt-4's own style or omit important details.

Source: Conversation with Bing, 2/14/2024 (1) How to evaluate a summarization task | OpenAI Cookbook. https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization. (2) A Step-By-Step Guide to Evaluating an LLM Text Summarization Task .... https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task. (3) Summarizing Worksheets & Activities | Reading Comprehension. https://www.ereadingworksheets.com/free-reading-worksheets/reading-comprehension-worksheets/summarizing-worksheets-and-activities/.

User Summary

Automated Evaluation using Rouge and Bertscore. Requires reference summary for doing the evaluation. Limitation: Misses nuanced aspects such as fluency, coherence etc. Human labeling is a bottleneck.

LLM based evaluation inspired from G-Eval incorporates reference free approach. Bottleneck is the Context window and could favor LLM generated snippets. Can be done at scale.

manisnesan commented 9 months ago

https://github.com/ibm-ecosystem-engineering/SuperKnowa - Build Enterprise RAG (Retriver Augmented Generation) Pipelines to tackle various Generative AI use cases with LLM's by simply plugging componants like Lego pieces. This repo is intended for IBM Ecosystem partners.

Guide to fine tune Flan T5 - https://www.datacamp.com/tutorial/flan-t5-tutorial

https://github.com/ibm-ecosystem-engineering/SuperKnowa/blob/main/Enterprise%20LLM%20Use%20Cases/3.%20Summarization/payload/summary-payload.json

User Prompt

```json
{
  "model_id": "google/flan-t5-xxl",
  "inputs": [],
  "parameters": {
    "decoding_method": "greedy",
    "temperature": 0.7,
    "top_p": 1,
    "top_k": 50,
    "min_new_tokens": 10,
    "max_new_tokens": 50
  }
}```

System Generated

The model_id field specifies the name of the LLM model that will be used for the summarization task. In this case, it is google/flan-t5-xxl, which is a large pre-trained model based on the T5 architecture.
The inputs field is an array that contains the input texts that need to be summarized. In this case, it is empty, which means that no input texts are provided yet.
The parameters field is an object that contains the hyperparameters that control the decoding process of the LLM model. These are:
- The decoding_method field specifies the algorithm that will be used to generate the summary texts. In this case, it is greedy, which means that the model will choose the most probable token at each step.
- The temperature field specifies the degree of randomness that will be applied to the token probabilities. A higher temperature means more diversity and creativity, while a lower temperature means more coherence and consistency. In this case, it is 0.7, which is a moderate value.
- The top_p field specifies the probability mass that will be considered for sampling the next token. Only tokens with a cumulative probability less than or equal to this value will be sampled. In this case, it is 1, which means that all tokens will be considered.
- The top_k field specifies the number of most probable tokens that will be considered for sampling the next token. Only tokens with a rank less than or equal to this value will be sampled. In this case, it is 50, which means that the model will choose from the top 50 tokens.
- The min_new_tokens field specifies the minimum number of new tokens that will be added to the summary text at each decoding step. This can prevent the model from generating too short or repetitive summaries. In this case, it is 10, which means that the model will generate at least 10 new tokens at each step.
- The max_new_tokens field specifies the maximum number of new tokens that will be added to the summary text at each decoding step. This can prevent the model from generating too long or verbose summaries. In this case, it is 50, which means that the model will generate at most 50 new tokens at each step.

I hope this helps you understand the json format better. If you have any other questions, feel free to ask me. 😊

: FLAN: A Flexible and Lightweight Attention Network for Few-Shot Natural Language Generation : Text Generation with Transformers : The Curious Case of Neural Text Degeneration : The Power of Scale for Parameter-Efficient Prompt Tuning : T5: Exploring the Limits of Transfer Learning with Text-to-Text Transformer

manisnesan commented 9 months ago

https://wandb.ai/mostafaibrahim17/ml-articles/reports/Compressing-the-Story-The-Magic-of-Text-Summarization--VmlldzozNTYxMjc2

manisnesan commented 9 months ago

Critical Dimensions for Snippet Generation's Data Annotation & Evaluation

[Relevance] Should be relevant and must directly address the user query or intent, hence may serve as a precise answer or summary of the content related to query.
- Outcome: User Engagement / Self Serve / Self Solve
[Consistency] Should be factually consistent with the original content.
- Outcome: Verifiability & Trust
[Conciseness] Should be brief, 1-2 sentences ranging from 50-165 character length.
- Outcome: Fast Scannability
[Coherence] Should not only be short but provide information in logical structure & flow, synthesizing the main idea.
- Outcome: Better synthesis of original content reducing user effort in decision making
[Fluency] Should be free of grammatical errors, misspellings etc.
- Outcome: User comprehension of the content
[Voice]: Should avoid the first person voice and reflect a neutral tone.
- Outcome: Focus on the information rather than author perspective.

manisnesan commented 9 months ago

Factscore

a new evaluation metric for assessing the factual precision of long-form text generated by large language models (LMs)
breaks down text generations into atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
presents an automated model that estimates FActScore, reducing the need for extensive human evaluation.

manisnesan commented 9 months ago

Uptrain - LLM evaluation tool from this post

[]()

manisnesan commented 9 months ago

My benchmark for LLM post started by Andrej karpathy - Post

manisnesan commented 9 months ago

Chain of Density produces more dense and human preferable summaries than the vanilla GPT4

See latent space - llm paper club

manisnesan commented 9 months ago

https://huggingface.co/ibm/labradorite-13b

Summary of the Article:
- The article is a model card for Labradorite-13b, a language model developed by IBM Research.
- Labradorite-13b is based on the LLaMA-2-13b model and has been trained using the LAB (Large-scale Alignment for chatBots) methodology.
- The model uses synthetic data for alignment tuning and was instructed by the Mixtral-8x7b-Instruct model.
Contributions:
- Introduces a novel synthetic data-based alignment tuning method for large language models (LLMs).
- Demonstrates the application of the LAB methodology in training a derivative of the LLaMA-2-13b model.
- Utilizes a teacher model, Mixtral-8x7b-Instruct, to guide the training process of Labradorite-13b.
Implications:
- The model has not been aligned to human preferences, which may result in problematic outputs[1].
- It inherits the limitations and constraints of its base model, LLaMA-2, and other models in the Llama 2 family[1].
- The approach taken for training Labradorite-13b could influence future methods for aligning LLMs with human preferences and reducing biases.
Suggested Related Papers:
- "LLaMA: Open and Efficient Foundation Language Models" which is related to the base model LLaMA-13B[3].
- "Metharme 13B" which is another instruct model based on Meta's LLaMA-13B, biased towards fiction writing and conversation[2].

Citations: [1] https://huggingface.co/ibm/labradorite-13b [2] https://huggingface.co/PygmalionAI/metharme-13b [3] https://huggingface.co/dfurman/LLaMA-13B

manisnesan commented 8 months ago

https://x.com/eugeneyan/status/1764066697454182592?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

What are some good resources on LM evals for downstream tasks (classification, summarization, translation)? Some I found:

• HELM: arxiv.org/abs/2211.09110 • NLG Systems: arxiv.org/abs/2008.12009 • LLMs Evals: arxiv.org/abs/2307.03109 • SummEval: arxiv.org/abs/2007.12626 • Benching LLMs for summarization: arxiv.org/abs/2301.13848 • MachineTranslate: machinetranslate.org/metrics • Evaluating ChatGPT extraction: arxiv.org/abs/2304.11633 • LLMs for Evals: arxiv.org/abs/2401.07103

Especially interested in classification, extraction, summarization, translation, copyright regurgitation, toxicity, etc

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

Components of LLM Evaluation: Model to Evaluate: It can be an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model (e.g., a HuggingFace text summarization pipeline). Metrics: LLM evaluation uses specific LLM metrics. Evaluation Data: The data your model is evaluated on can be a pandas DataFrame, a Python list, a NumPy array, or an mlflow.data.dataset.Dataset() instance.

manisnesan commented 7 months ago

Chain of Density Prompting technique

Focusses on specifically geared towards improving summarization tasks.
focus on controlling information density makes it an invaluable tool for generating high-quality summaries.

GPT 4 Summarization with Chain of Density Prompting

The following CoD prompt is taken directly from the paper.

Article: {{ ARTICLE }}

You will generate increasingly concise, entity-dense summaries of the above Article.

Repeat the following 2 steps 5 times.

Step 1. Identify 1-3 informative Entities ("; " delimited) from the Article which are missing from the previously generated summary. Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

A Missing Entity is:

Relevant: to the main story.
Specific: descriptive yet concise (5 words or fewer).
Novel: not in the previous summary.
Faithful: present in the Article.
Anywhere: located anywhere in the Article.

Guidelines:

The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
Missing entities can appear anywhere in the new summary.
Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

Remember, use the exact same number of words for each summary.

Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".

manisnesan commented 7 months ago

Prompt Engineering Strategy - Split Complex tasks into Simpler tasks

Complex tasks is redefined as workflow of simpler tasks.
Outputs of earlier tasks used to construct the inputs to later tasks.

Summarize long documents piecewise and construct full summary recursively

manisnesan commented 4 months ago

No statistical significance on the performance impact of LLM on problem solving tasks when changing the sampling of temperature from 0.0 to 1.0.

Recommended to set 0.0 for maximal reproducibility without accuracy impact.

https://x.com/sergeykarayev/status/1810535376295243888?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ

manisnesan / fastchai