Pickle file generation - Githubissues

koala73 commented 1 year ago

Hello,

Great work, thank you for sharing

One item, you mention : "Before running the data pre-processing script, you need to have a pickle file with a dictionary, where keys are book names and values are full texts of the books. Refer to data/example_all_books.pkl for an example."

I tried to generate the pickle file as below, but it's not exactly working out as an input to get_summaries_hier.py , and I feel the script chunk_data.py might need to be used,

do you have any script that provides a proper pickle file ?

from PyPDF2 import PdfReader
from transformers import GPT2Tokenizer
import os
import pickle

def read_pdf(file_path, start_page=0, end_page=None, token_limit=1024):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    with open(file_path, 'rb') as f:
        pdf_reader = PdfReader(f)
        total_pages = len(pdf_reader.pages)
        end_page = end_page if end_page is not None else total_pages
        text_chunks = []
        current_chunk = ""

        for page_num in range(start_page, end_page):
            if page_num < total_pages:
                page = pdf_reader.pages[page_num]
                full_text = page.extract_text()
                for word in full_text.split():
                    temp_chunk = current_chunk + " " + word
                    if len(tokenizer.encode(temp_chunk)) < token_limit:
                        current_chunk = temp_chunk
                    else:
                        text_chunks.append(current_chunk.strip())
                        current_chunk = word

        if current_chunk:
            text_chunks.append(current_chunk.strip())

    return text_chunks

def create_pickle_from_dir(input_dir, output_file, start_page=0, end_page=None, token_limit=1024):
    book_data = {}
    for file_name in os.listdir(input_dir):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(input_dir, file_name)
            book_name = os.path.splitext(file_name)[0]
            book_data[book_name] = "|||".join(read_pdf(file_path, start_page, end_page, token_limit))

    with open(output_file, 'wb') as f:
        pickle.dump(book_data, f)

# Replace with actual paths
input_dir = 'data/booksPDF'  # Replace with your directory
output_file = 'data/all_books.pkl'  # Replace with your output file

# Optional: specify start and end page numbers, token limit
start_page = 8  # Replace with your start page
end_page = 423  # Replace with your end page
token_limit = 1024  # GPT-2's maximum token limit

create_pickle_from_dir(input_dir, output_file, start_page, end_page, token_limit)

lilakk commented 1 year ago

Here you should obtain a pickle file with book name as key and full text of the book as value, then run chunk_data.py as stated in the README. Please read the "Pre-process data" section more carefully, let me know if I misunderstood your question.

koala73 commented 1 year ago

Sorry, missed your comment.

The statement " obtain a pickle file with book name as key and full text of the book as value " is not exactly complete, because if I just do that and run chunk_data.py I get an error as below :

0%| | 0/1 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (210368 > 1024). Running this sequence through the model will result in indexing errors It looks l have to create a pickle file that is within the 1024 token limit.

If i don't , and i eventually ignore the previous error, when running get_summaries_hier.py I get :

Level 0 has 854757 chunks
Token indices sequence length is longer than the specified maximum sequence length for this model (2414872 > 1024). Running this sequence through the model will result in indexing errors
Level 0 chunk 0
Summary limit: 11
Token Limit: 11
Word Limit: 7
Prompt Size: 174 tokens
PROMPT:

---

Below is a part of a story:

---

s

---

So my question is isn't there a proper script that takes a pdf and generate a pickle that matches all the scripts requirements?

lilakk commented 1 year ago

Hi, apologies for the late reply, my email notification didn't get through. I looked into the code, there is no issue with chunk_data.py. As for Token indices sequence length is longer than the specified maximum sequence length for this model (210368 > 1024). Running this sequence through the model will result in indexing errors, it's totally fine to ignore this message. The problem didn't come from this line, but from some small bugs within the summarization scripts. I have fixed them now and pushed the changes, please check again and let me know there is an issue! I will make sure to get back to you quickly.

lilakk commented 9 months ago

If there are no further questions, I'm closing this issue. But feel free to reach out if there is any problem.

lilakk / BooookScore

Pickle file generation #1