Closed koala73 closed 9 months ago
Here you should obtain a pickle file with book name as key and full text of the book as value, then run chunk_data.py as stated in the README. Please read the "Pre-process data" section more carefully, let me know if I misunderstood your question.
Sorry, missed your comment.
The statement " obtain a pickle file with book name as key and full text of the book as value " is not exactly complete, because if I just do that and run chunk_data.py I get an error as below :
0%| | 0/1 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (210368 > 1024). Running this sequence through the model will result in indexing errors
It looks l have to create a pickle file that is within the 1024 token limit.
If i don't , and i eventually ignore the previous error, when running get_summaries_hier.py I get :
Level 0 has 854757 chunks
Token indices sequence length is longer than the specified maximum sequence length for this model (2414872 > 1024). Running this sequence through the model will result in indexing errors
Level 0 chunk 0
Summary limit: 11
Token Limit: 11
Word Limit: 7
Prompt Size: 174 tokens
PROMPT:
---
Below is a part of a story:
---
s
---
So my question is isn't there a proper script that takes a pdf and generate a pickle that matches all the scripts requirements?
Hi, apologies for the late reply, my email notification didn't get through. I looked into the code, there is no issue with chunk_data.py
. As for Token indices sequence length is longer than the specified maximum sequence length for this model (210368 > 1024). Running this sequence through the model will result in indexing errors
, it's totally fine to ignore this message. The problem didn't come from this line, but from some small bugs within the summarization scripts. I have fixed them now and pushed the changes, please check again and let me know there is an issue! I will make sure to get back to you quickly.
If there are no further questions, I'm closing this issue. But feel free to reach out if there is any problem.
Hello,
Great work, thank you for sharing
One item, you mention : "Before running the data pre-processing script, you need to have a pickle file with a dictionary, where keys are book names and values are full texts of the books. Refer to data/example_all_books.pkl for an example."
I tried to generate the pickle file as below, but it's not exactly working out as an input to get_summaries_hier.py , and I feel the script chunk_data.py might need to be used,
do you have any script that provides a proper pickle file ?