Open manisnesan opened 1 year ago
From https://til.simonwillison.net/gpt3/chatgpt-api
class ChatBot:
def __init__(self, system=""):
self.system = system
self.messages = []
if self.system:
self.messages.append({"role": "system", "content": self.system})
def __call__(self, message):
self.messages.append({"role": "user", "content": message})
result = self.execute()
self.messages.append({"role": "assistant", "content": result})
return result
def execute(self):
completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=self.messages)
# Uncomment this to print out token usage each time, e.g.
# {"completion_tokens": 86, "prompt_tokens": 26, "total_tokens": 112}
# print(completion.usage)
return completion.choices[0].message.content
simon = ChatBot("You are a chatbot imitating Simon Willison. Pretend to be Simon.") simon("Tell me about yourself")
https://medium.com/@agrofail/summarization-with-hugging-face-and-blurr-1b613265d155
BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings
https://twitter.com/syedmuzamilm/status/1654051264295055361?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ
Article and Youtube Summarizer using Youtube using langchain.
import os
from langchain import OpenAI
from langchain.document_loaders import YoutubeLoader, WebBaseLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
def document_splitter(docs):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 2000,
chunk_overlap = 100,
)
splitted_docs = text_splitter.split_documents(docs)
return splitted_docs
def get_summary(splitted_docs):
chain = load_summarize_chain(llm, chain_type="map_reduce")
summary =chain.run(splitted_docs)
return summary
def youtube_video_summariser(url):
loader = YoutubeLoader.from_youtube_url(url)
docs = loader.load()
splitted_docs = document_splitter(docs)
summary = get_summary(splitted_docs)
return summary
def article_summariser(url):
loader = WebBaseLoader(url)
docs = loader.load()
splitted_docs = document_splitter(docs)
summary = get_summary(splitted_docs)
return summary
Example Prompts
Prompt engineering may also help you unlock summarization capabilities of GPT-3. The idea here is pretty simple: in our prompt we put in a passage of text that we want to summarize and then add something like "To summarize: " or "TL;DR".
You could try out other prompts besides "TL;DR" like "in a couple words," "to summarize", "to simplify" and similar.
Text Summarization by fine tuning T5 or BART method https://wandb.ai/biased-ai/huggingface/reports/Text-Summarization-on-HuggingFace--Vmlldzo3ODA5MjI
Talks about the Challenges
One of the biggest challenges with summarization, however, is factuality: does the summary reflect accurately what the original document said? There are other characteristics, such as fluency and relevance that are also important, but LLMs are actually pretty good at both of those. Factuality (or its evil twin: hallucination) on the other hand is a known issue with LLMs. And it’s no use being fluent if you’re wrong.
Example Solution
✗ head -n 1 ./data/raw/public_solutions_req_fields.jsonl
{"title":"Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool","teaser":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"issue":["- Users are able to login via ssh or console, but not able to login to GUI.- After accepting user name and password, GUI getting hung."],"environment":[" Red Hat Enterprise Linux 6.2 vmware vmware tools"],"rootcause":[" There is some problem with the interaction between vmware tools and RHEL6. "],"resolution":["* Uninstall the vmware tools and try again."],"diagnostic":["check ~/.xsession-errors logs
~ "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"." ~"]}
Different response variations from the prompt
✗ llm "Summarize the provided json in 4 sentences or less concatenating the title, rootcause and issue. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"
Unable to login to GUI in Red Hat Enterprise Linux 6.2 after updating VMware tool. Users can login via ssh or console, but not able to login to GUI. GUI hangs after accepting username and password. The interaction between VMWare tools and RHEL6 causes the problem. To resolve, uninstall the VMWare tools and try again.
✗ llm "Summarize the provided json in 4 sentences or less. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"
Some users are unable to login to the GUI in Red Hat Enterprise Linux 6.2 after updating VMware tools. They can login via SSH or console, but the GUI hangs after accepting their username and password. This issue seems to be caused by a problem with the interaction between VMware tools and RHEL6. To resolve it, users should uninstall the VMware tools and try again. One possible diagnostic step is to check the ~/.xsession-errors logs for the error message "Xlib: extension ' VMWARE_CTRL' missing on display ':0.0'."
Added instruction to ensure summary consistency with the original content. Based on Paper: ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
✗ llm "You are a summary generator. Summarize the json in 200 characters or less. The summary should be consistent with the json. Note that consistency means all information in summary is supported by json. This will be used as a snippet for search results. $(head -n 1 ./data/raw/public_solutions_req_fields.jsonl)"
Unable to login to GUI in Redhat Enterprise Linux 6.2 after updating vmware tool. Users can login via ssh or console, but GUI hangs after entering username and password. The issue may be due to a problem with the interaction between vmware tools and RHEL6. To resolve, uninstall vmware tools and try again. Check ~/.xsession-errors logs for "Xlib: extension "VMWARE_CTRL" missing on display ":0.0"."
SummEval - Summarize Evaluation
SummEval is a project that aims to provide resources for the evaluation of summarization systems. It includes summaries generated by different models, human annotations of focus and coverage, and a toolkit for computing various metrics and correlations. The project is a collaboration between Yale LILY Lab and Salesforce Research, and the paper was published in the Findings of ACL 2021 ¹. SummEval covers eight languages: English, Indonesian, French, Turkish, Chinese, Russian, German, and Spanish. The GitHub repository contains the data, code, and instructions for using SummEval.
Source: Conversation with Bing, 2023-09-18 (1) GitHub - Yale-LILY/SummEval: Resources for the "SummEval: Re-evaluating .... https://github.com/Yale-LILY/SummEval. (2) GitHub - fajri91/Multi_SummEval: Evaluating the Efficacy of .... https://github.com/fajri91/Multi_SummEval. (3) GitHub - YizhuLiu/summeval. https://github.com/YizhuLiu/summeval.
Similar Problem on News Summarization
https://www.arxiv-vanity.com/papers/2301.13848/
This page is a research paper that investigates the summarization capabilities of large language models (LLMs). The main points are:
Source: Conversation with Bing, 11/29/2023 (1) First, we .... undefined. (2) Not only d.... undefined. (3) Comparing .... undefined. (4) However, t.... undefined.
Another way to evaluate is treat it as Entailment Inference
From https://arxiv.org/pdf/2303.15621.pdf
We provide ChatGPT with the question including the source document and the corresponding generated summary and ask it to answer yes or no to infer the consistency between the source document and the corresponding generated summary, and then we collect the decisions from the outputs and aggregate the results
LLM-based summarization: A case study of human, Llama 2 70b and GPT-4 summarization quality
Summary: The search results provide a detailed case study of the quality of summarization by Llama 2 70b and GPT-4 compared to human legislative interns. The study involved a blind test of a legal expert who scored the summaries generated by the different authors. The results showed that GPT-4 outperformed both Llama 2 70b and human interns in summarizing legislative bills. The study also discussed the insights gained from GPT-4's superior performance, which were used to improve the Llama 2 70b prompt to enhance its quality. The study highlighted the challenges and potential biases in determining the authorship of the summaries, as well as the specific issues with the Llama 2 70b prompts. It also presented the modified prompt and the positive impact it had on the quality of the Llama 2 70b summaries. The study concluded that while GPT-4 outperformed Llama 2 70b and human interns, using GPT-4 to direct the improvement of Llama 2 70b's prompts led to an enhancement in the latter's quality.
The study provides valuable insights into the comparative performance of AI models and human interns in summarizing legislative bills, as well as the potential for using superior models to improve the performance of others.
The last question is: "Given that Llama 2 70b costs approximately 3% of what GPT-4 costs, is there a way we can improve the performance of Llama 2 70b?"
[x] - Evaluation and hallucination detection for abstracting summaries - https://eugeneyan.com/writing/abstractive/
https://github.com/eugeneyan/visualizing-finetunes - bootstrap on wiki summaries to detect news article hallucinations
Out-of-Domain Finetuning to Bootstrap Hallucination Detection - https://eugeneyan.com/writing/finetuning/
From rasbt post - Flan T5 is a great go to model for text classification.
Tiny titans - Can smaller LLM models punch above their weight for meeting summarization
Example prompt Generate a concise, accurate, and relevant summary of the following KCS solution, maintaining fluency and consistency with the original content."
How to evaluate a summarization task - Openai cookbook https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization
This article¹ explains how to evaluate a summarization task using different methods, such as:
Source: Conversation with Bing, 2/14/2024 (1) How to evaluate a summarization task | OpenAI Cookbook. https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization. (2) A Step-By-Step Guide to Evaluating an LLM Text Summarization Task .... https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task. (3) Summarizing Worksheets & Activities | Reading Comprehension. https://www.ereadingworksheets.com/free-reading-worksheets/reading-comprehension-worksheets/summarizing-worksheets-and-activities/.
Automated Evaluation using Rouge and Bertscore. Requires reference summary for doing the evaluation. Limitation: Misses nuanced aspects such as fluency, coherence etc. Human labeling is a bottleneck.
LLM based evaluation inspired from G-Eval incorporates reference free approach. Bottleneck is the Context window and could favor LLM generated snippets. Can be done at scale.
https://github.com/ibm-ecosystem-engineering/SuperKnowa - Build Enterprise RAG (Retriver Augmented Generation) Pipelines to tackle various Generative AI use cases with LLM's by simply plugging componants like Lego pieces. This repo is intended for IBM Ecosystem partners.
Guide to fine tune Flan T5 - https://www.datacamp.com/tutorial/flan-t5-tutorial
```json
{
"model_id": "google/flan-t5-xxl",
"inputs": [],
"parameters": {
"decoding_method": "greedy",
"temperature": 0.7,
"top_p": 1,
"top_k": 50,
"min_new_tokens": 10,
"max_new_tokens": 50
}
}```
model_id
field specifies the name of the LLM model that will be used for the summarization task. In this case, it is google/flan-t5-xxl
, which is a large pre-trained model based on the T5 architecture.inputs
field is an array that contains the input texts that need to be summarized. In this case, it is empty, which means that no input texts are provided yet.parameters
field is an object that contains the hyperparameters that control the decoding process of the LLM model. These are:
decoding_method
field specifies the algorithm that will be used to generate the summary texts. In this case, it is greedy
, which means that the model will choose the most probable token at each step.temperature
field specifies the degree of randomness that will be applied to the token probabilities. A higher temperature means more diversity and creativity, while a lower temperature means more coherence and consistency. In this case, it is 0.7
, which is a moderate value.top_p
field specifies the probability mass that will be considered for sampling the next token. Only tokens with a cumulative probability less than or equal to this value will be sampled. In this case, it is 1
, which means that all tokens will be considered.top_k
field specifies the number of most probable tokens that will be considered for sampling the next token. Only tokens with a rank less than or equal to this value will be sampled. In this case, it is 50
, which means that the model will choose from the top 50 tokens.min_new_tokens
field specifies the minimum number of new tokens that will be added to the summary text at each decoding step. This can prevent the model from generating too short or repetitive summaries. In this case, it is 10
, which means that the model will generate at least 10 new tokens at each step.max_new_tokens
field specifies the maximum number of new tokens that will be added to the summary text at each decoding step. This can prevent the model from generating too long or verbose summaries. In this case, it is 50
, which means that the model will generate at most 50 new tokens at each step.I hope this helps you understand the json format better. If you have any other questions, feel free to ask me. 😊
: FLAN: A Flexible and Lightweight Attention Network for Few-Shot Natural Language Generation : Text Generation with Transformers : The Curious Case of Neural Text Degeneration : The Power of Scale for Parameter-Efficient Prompt Tuning : T5: Exploring the Limits of Transfer Learning with Text-to-Text Transformer
Uptrain - LLM evaluation tool from this post
[]()
My benchmark for LLM post started by Andrej karpathy - Post
Chain of Density produces more dense and human preferable summaries than the vanilla GPT4
https://huggingface.co/ibm/labradorite-13b
Summary of the Article:
Contributions:
Implications:
Suggested Related Papers:
Citations: [1] https://huggingface.co/ibm/labradorite-13b [2] https://huggingface.co/PygmalionAI/metharme-13b [3] https://huggingface.co/dfurman/LLaMA-13B
https://x.com/eugeneyan/status/1764066697454182592?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ
What are some good resources on LM evals for downstream tasks (classification, summarization, translation)? Some I found:
• HELM: arxiv.org/abs/2211.09110 • NLG Systems: arxiv.org/abs/2008.12009 • LLMs Evals: arxiv.org/abs/2307.03109 • SummEval: arxiv.org/abs/2007.12626 • Benching LLMs for summarization: arxiv.org/abs/2301.13848 • MachineTranslate: machinetranslate.org/metrics • Evaluating ChatGPT extraction: arxiv.org/abs/2304.11633 • LLMs for Evals: arxiv.org/abs/2401.07103
Especially interested in classification, extraction, summarization, translation, copyright regurgitation, toxicity, etc
https://mlflow.org/docs/latest/llms/llm-evaluate/index.html
Components of LLM Evaluation: Model to Evaluate: It can be an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model (e.g., a HuggingFace text summarization pipeline). Metrics: LLM evaluation uses specific LLM metrics. Evaluation Data: The data your model is evaluated on can be a pandas DataFrame, a Python list, a NumPy array, or an mlflow.data.dataset.Dataset() instance.
GPT 4 Summarization with Chain of Density Prompting
The following CoD prompt is taken directly from the paper.
Article: {{ ARTICLE }}
You will generate increasingly concise, entity-dense summaries of the above Article.
Repeat the following 2 steps 5 times.
Step 1. Identify 1-3 informative Entities ("; " delimited) from the Article which are missing from the previously generated summary. Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.
A Missing Entity is:
Guidelines:
Remember, use the exact same number of words for each summary.
Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".
No statistical significance on the performance impact of LLM on problem solving tasks when changing the sampling of temperature from 0.0 to 1.0.
Recommended to set 0.0 for maximal reproducibility without accuracy impact.
https://x.com/sergeykarayev/status/1810535376295243888?s=46&t=aOEVGBVv9ICQLUYL4fQHlQ
Expected Outcomes
Prompt: Summarize the content from the url (do not emit the url back) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_file_systems/index into a short summary.
Output: The documentation guide from Red Hat explains how to manage file systems in Red Hat Enterprise Linux 9. It covers various file systems, their benefits, and how to create, mount, resize, and manage them. The guide also provides practical examples and covers other topics such as partitioning, formatting disks, quotas, compression, encryption, and consistency checking and repair.
Outline Steps
Filter the KCS Solutions to only non-entitled / public solutions using
jq
. We are going to use managed LLM providers.Filter to only these fields from each record
Craft a prompt, Sample Code to OpenAI GPT 3.5