gosha70 / document-assistant

RAG (Retrieval-Augmented Generation) framework of merging private vector databases storing unstructured document with LLM, and providing Chat/QA application which can be run locally.
Creative Commons Attribution Share Alike 4.0 International
9 stars 0 forks source link

pdf_converter.py doesn‘t work #1

Open Zhongxu-Wang opened 5 months ago

Zhongxu-Wang commented 5 months ago

There are many functions whose parameters do not correspond in pdf_converter.py , which brings me a lot of confusion

gosha70 commented 5 months ago

@malu01 Thank you for reporting the issue in pdf_converter.py.

I hoped I have fixed all bugs in this commit (at least thee ones which did not allow to create a vector-store from PDF and run the D.O.T. application).

Here is the test run I did by creating RAG from PDF for the Machine Learning A to Z course:

document-assistant % python3 -m embeddings.embedding_database --dir_path .../Machine_Learning_A_Z/ --file_types pdf --persist_directory ml_a_z_db
2024-04-27 19:57:33 - INFO - Creating the vectorsstore the arguments: Namespace(dir_path='.../Machine_Learning_A_Z/', zip_file=None, splits_directory=None, file_types=['pdf'], file_patterns=[], persist_directory='ml_a_z_db', model_name=None, collection_name=None, test_question=None)
2024-04-27 19:57:33 - INFO - Loading .pdf with names confirming the name pattern: '**/*'
2024-04-27 19:58:01 - INFO - PDF splits: 658
2024-04-27 19:58:01 - INFO - Loaded 658 .pdf documents
2024-04-27 19:58:01 - INFO - Total number of unstructured document splits: 659
2024-04-27 19:58:04 - INFO - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2024-04-27 19:58:08 - INFO - Processing the batch 1/3: 300 documents
2024-04-27 19:58:08 - INFO - Creating the embedding vectorstore with client=INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
) model_name='hkunlp/instructor-large' cache_folder=None model_kwargs={'device': 'cpu'} encode_kwargs={'normalize_embeddings': True} embed_instruction='Represent the document for retrieval: ' query_instruction='Represent the question for retrieving supporting documents: ' for 300 document splits ...
2024-04-27 19:58:08 - INFO - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2024-04-27 20:01:25 - INFO - Finished the creation of embedding vectorstore.
2024-04-27 20:01:25 - INFO - Processing the batch 2/3: 300 documents
2024-04-27 20:01:25 - INFO - The async task #300 is updating the embedding vectorstore with 300 document splits ...
2024-04-27 20:01:25 - INFO - Processing the batch 3/3: 59 documents
2024-04-27 20:01:25 - INFO - The async task #600 is updating the embedding vectorstore with 59 document splits ...
2024-04-27 20:01:25 - INFO - Waiting for 2 async tasks to finish ...
The async task #600 just finished. Total finished tasks: 1
The async task #300 just finished. Total finished tasks: 2
2024-04-27 20:04:34 - INFO - All 2 async tasks finished.
2024-04-27 20:04:34 - INFO - Saving the vectorstore ...
Created MANIFEST.MF at: ml_a_z_db/META-INF/MANIFEST.MF
2024-04-27 20:04:34 - INFO - Finished the creation of vectorstore creation in 7.02 minutes.
2024-04-27 20:04:34 - INFO - The vectorstore stores 959 documents
2024-04-27 20:04:34 - INFO - The vectorstore: <langchain.vectorstores.chroma.Chroma object at 0x122233150>
2024-04-27 19:57:33 - INFO - Loading .pdf with names confirming the name pattern: '**/*'
2024-04-27 19:58:01 - INFO - PDF splits: 658
2024-04-27 19:58:01 - INFO - Loaded 658 .pdf documents
2024-04-27 19:58:01 - INFO - Total number of unstructured document splits: 659
2024-04-27 19:58:04 - INFO - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2024-04-27 19:58:08 - INFO - Processing the batch 1/3: 300 documents
2024-04-27 19:58:08 - INFO - Creating the embedding vectorstore with client=INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
) model_name='hkunlp/instructor-large' cache_folder=None model_kwargs={'device': 'cpu'} encode_kwargs={'normalize_embeddings': True} embed_instruction='Represent the document for retrieval: ' query_instruction='Represent the question for retrieving supporting documents: ' for 300 document splits ...
2024-04-27 19:58:08 - INFO - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2024-04-27 20:01:25 - INFO - Finished the creation of embedding vectorstore.
2024-04-27 20:01:25 - INFO - Processing the batch 2/3: 300 documents
2024-04-27 20:01:25 - INFO - The async task #300 is updating the embedding vectorstore with 300 document splits ...
2024-04-27 20:01:25 - INFO - Processing the batch 3/3: 59 documents
2024-04-27 20:01:25 - INFO - The async task #600 is updating the embedding vectorstore with 59 document splits ...
2024-04-27 20:01:25 - INFO - Waiting for 2 async tasks to finish ...
The async task #600 just finished. Total finished tasks: 1
The async task #300 just finished. Total finished tasks: 2
2024-04-27 20:04:34 - INFO - All 2 async tasks finished.
2024-04-27 20:04:34 - INFO - Saving the vectorstore ...
Created MANIFEST.MF at: ml_a_z_db/META-INF/MANIFEST.MF
2024-04-27 20:04:34 - INFO - Finished the creation of vectorstore creation in 7.02 minutes.
2024-04-27 20:04:34 - INFO - The vectorstore stores 959 documents
2024-04-27 20:04:34 - INFO - The vectorstore: <langchain.vectorstores.chroma.Chroma object at 0x122233150>
image